Skip to content

Update attrition#363

Draft
ablack3 wants to merge 2 commits intodevelopfrom
issue_362_attrition
Draft

Update attrition#363
ablack3 wants to merge 2 commits intodevelopfrom
issue_362_attrition

Conversation

@ablack3
Copy link
Copy Markdown
Collaborator

@ablack3 ablack3 commented Mar 31, 2026

update attrition recording. preserve person id. remove inequality full join in favor of inner join followed by filter.

…l join in favor of inner join followed by filter.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates how attrition is recorded and how cohort/event records are joined when constructing treatment histories, with the intent to preserve the original person identifier and to replace a non-equi full_join() with an equi-inner_join() plus post-filtering.

Changes:

  • Remove subject_id_origin handling and treat subject_id/personId as the preserved original ID (character).
  • Replace non-equi full_join() between events/targets with inner_join(personId) followed by a date-window filter(), and add a new attrition entry for “no event records”.
  • Re-number attrition reason_ids across pathway construction steps and adjust cohort table assertions accordingly.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
R/TreatmentPatterns-package.R Removes subject_id_origin from declared global variables.
R/exportPatientLevel.R Stops adding subject_id_origin to patient-level export; returns table unchanged.
R/constructPathways.R Reworks event/target join strategy and updates attrition reason IDs (adds a new attrition step).
R/computePathways.R Updates attrition reason ID and changes subject_id type assertion to character.
R/CDMInterface.R Casts subject_id to character, changes person join key casting, and adds attrition entries after cohort filtering steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +264 to +274
# Attrition: persons with no event records (may appear as 'None' paths if nonePaths = TRUE)
attrCounts <- fetchAttritionCounts(andromeda, "treatmentHistoryJoined")
appendAttrition(
toAdd = data.frame(
number_records = attrCounts$nRecords,
number_subjects = attrCounts$nSubjects,
reason_id = 4,
reason = "Subjects with no event records (may appear as 'None' paths)"
),
andromeda = andromeda
)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new attrition step for “Subjects with no event records” is computed from treatmentHistoryJoined, but that table is created via inner_join(). Subjects with zero event records will not appear in treatmentHistoryJoined, so this attrition count cannot be correct. Compute this attrition from the target cohort population instead (e.g., target persons anti-joined against event persons), and/or compute “no events within window” after the window filter.

Copilot uses AI. Check for mistakes.
Comment on lines +255 to +281
# Step 1: Equi-join on person to match events to targets
andromeda$treatmentHistoryJoined <- dplyr::inner_join(
x = andromeda$eventCohorts,
y = andromeda$targetCohorts,
by = dplyr::join_by(
personId == personId,
subject_id_origin == subject_id_origin,
y$indexDate <= x$startDate,
x$startDate <= y$endDate,
personId == personId
), suffix = c("Event", "Target")
)

# Attrition: persons with no event records (may appear as 'None' paths if nonePaths = TRUE)
attrCounts <- fetchAttritionCounts(andromeda, "treatmentHistoryJoined")
appendAttrition(
toAdd = data.frame(
number_records = attrCounts$nRecords,
number_subjects = attrCounts$nSubjects,
reason_id = 4,
reason = "Subjects with no event records (may appear as 'None' paths)"
),
andromeda = andromeda
)

# Step 2: Filter to events within the time window
andromeda[[sprintf("cohortTable_%s", targetCohortId)]] <- andromeda$treatmentHistoryJoined %>%
dplyr::filter(
.data$indexDate <= .data$startDateEvent,
.data$startDateEvent <= .data$endDateTarget
)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Joining targets/events on personId only and then filtering by dates can create a very large intermediate (cartesian within person when there are multiple targets and many events), which can severely impact runtime and memory on real CDM-scale data. Prefer pushing the date constraints into the join (non-equi join) or otherwise reduce the join size before materializing the combined table.

Copilot uses AI. Check for mistakes.
Comment on lines 255 to 263
# Step 1: Equi-join on person to match events to targets
andromeda$treatmentHistoryJoined <- dplyr::inner_join(
x = andromeda$eventCohorts,
y = andromeda$targetCohorts,
by = dplyr::join_by(
personId == personId,
subject_id_origin == subject_id_origin,
y$indexDate <= x$startDate,
x$startDate <= y$endDate,
personId == personId
), suffix = c("Event", "Target")
)

Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

andromeda$treatmentHistoryJoined is persisted in the returned Andromeda even though it is only used as an intermediate for the window filter. This can significantly bloat the returned object (and disk usage) for large runs. Consider removing it after cohortTable_<target> is created, or storing it only behind a debug/diagnostics option.

Copilot uses AI. Check for mistakes.
R/CDMInterface.R Outdated
number_records = sum(n),
number_subjects = length(n),
reason_id = 2,
reason = "Persons with no record in target cohort",
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attrition reason says “Persons with no record in target cohort”, but the preceding filter keeps only subjects that DO have at least one target cohort record. This makes the attrition output misleading; rename the reason to reflect the step (e.g., “After filtering to persons with >=1 target cohort record”).

Suggested change
reason = "Persons with no record in target cohort",
reason = "After filtering to persons with >=1 target cohort record",

Copilot uses AI. Check for mistakes.
R/CDMInterface.R Outdated
Comment on lines 108 to 115
# Filter to persons with at least one target cohort record
targetId <- as.numeric(targetCohortIds)

andromeda[[andromedaTableName]] <- andromeda[[andromedaTableName]] %>%
dplyr::mutate(cohort_definition_id = as.numeric(.data$cohort_definition_id)) %>%
dplyr::group_by(.data$subject_id) %>%
dplyr::filter(any(.data$cohort_definition_id %in% targetId, na.rm = TRUE)) %>%
dplyr::ungroup()
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target-cohort presence filter is applied before the minEraDuration filter. After removing short eras, some subjects may no longer have any target records, which contradicts the intent of the target filter and can skew attrition reporting. Apply minEraDuration filtering before the target filter or re-run the target filter after minEraDuration.

Copilot uses AI. Check for mistakes.
assertions <- checkmate::makeAssertCollection()
checkmate::assertIntegerish(cohortTableHead$cohort_definition_id, add = assertions)
checkmate::assertIntegerish(cohortTableHead$subject_id, add = assertions)
checkmate::assertCharacter(cohortTableHead$subject_id, add = assertions)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing subject_id validation to character implies personId is now character throughout outputs. This is a breaking API change for downstream consumers and also conflicts with existing tests that expect numeric/integer personId (e.g., tests/testthat/test-computePathways.R around the filterTreatments test). Either preserve numeric personId (and add a separate origin/id column) or update the public contract + tests accordingly.

Suggested change
checkmate::assertCharacter(cohortTableHead$subject_id, add = assertions)
checkmate::assertIntegerish(cohortTableHead$subject_id, add = assertions)

Copilot uses AI. Check for mistakes.
dplyr::select("personId", "subject_id_origin") %>%
dplyr::inner_join(table, by = dplyr::join_by(personId == personId)) %>%
dplyr::distinct()
# personId already contains the original subject_id (as character)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addOriginId() no longer adds the original ID column to the exported patient-level CSV, which changes the treatment_history.csv schema (and will likely change ncol() expectations in tests/testthat/test-exportPatientLevel.R). If the intent is to drop subject_id_origin, update the export contract/tests; otherwise keep exporting a dedicated origin identifier column separate from personId.

Suggested change
# personId already contains the original subject_id (as character)
# Ensure a dedicated origin identifier column is present for export.
# personId already contains the original subject_id (as character),
# so if subject_id_origin is missing we create it from personId.
if (!"subject_id_origin" %in% colnames(table)) {
table <- dplyr::mutate(table, subject_id_origin = .data$personId)
}

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 98.11321% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
R/constructPathways.R 96.29% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@mvankessel-EMC
Copy link
Copy Markdown
Collaborator

mvankessel-EMC commented Apr 1, 2026

@ablack3 I'm fully behind the extra attrition steps. Could you split that out into a separate PR? I still have to think about the other changes. Edit; also you could you let Copilot re-review after the changes you made on its initial feedback?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants