Conversation
…l join in favor of inner join followed by filter.
There was a problem hiding this comment.
Pull request overview
This PR updates how attrition is recorded and how cohort/event records are joined when constructing treatment histories, with the intent to preserve the original person identifier and to replace a non-equi full_join() with an equi-inner_join() plus post-filtering.
Changes:
- Remove
subject_id_originhandling and treatsubject_id/personIdas the preserved original ID (character). - Replace non-equi
full_join()between events/targets withinner_join(personId)followed by a date-windowfilter(), and add a new attrition entry for “no event records”. - Re-number attrition
reason_ids across pathway construction steps and adjust cohort table assertions accordingly.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| R/TreatmentPatterns-package.R | Removes subject_id_origin from declared global variables. |
| R/exportPatientLevel.R | Stops adding subject_id_origin to patient-level export; returns table unchanged. |
| R/constructPathways.R | Reworks event/target join strategy and updates attrition reason IDs (adds a new attrition step). |
| R/computePathways.R | Updates attrition reason ID and changes subject_id type assertion to character. |
| R/CDMInterface.R | Casts subject_id to character, changes person join key casting, and adds attrition entries after cohort filtering steps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
R/constructPathways.R
Outdated
| # Attrition: persons with no event records (may appear as 'None' paths if nonePaths = TRUE) | ||
| attrCounts <- fetchAttritionCounts(andromeda, "treatmentHistoryJoined") | ||
| appendAttrition( | ||
| toAdd = data.frame( | ||
| number_records = attrCounts$nRecords, | ||
| number_subjects = attrCounts$nSubjects, | ||
| reason_id = 4, | ||
| reason = "Subjects with no event records (may appear as 'None' paths)" | ||
| ), | ||
| andromeda = andromeda | ||
| ) |
There was a problem hiding this comment.
The new attrition step for “Subjects with no event records” is computed from treatmentHistoryJoined, but that table is created via inner_join(). Subjects with zero event records will not appear in treatmentHistoryJoined, so this attrition count cannot be correct. Compute this attrition from the target cohort population instead (e.g., target persons anti-joined against event persons), and/or compute “no events within window” after the window filter.
R/constructPathways.R
Outdated
| # Step 1: Equi-join on person to match events to targets | ||
| andromeda$treatmentHistoryJoined <- dplyr::inner_join( | ||
| x = andromeda$eventCohorts, | ||
| y = andromeda$targetCohorts, | ||
| by = dplyr::join_by( | ||
| personId == personId, | ||
| subject_id_origin == subject_id_origin, | ||
| y$indexDate <= x$startDate, | ||
| x$startDate <= y$endDate, | ||
| personId == personId | ||
| ), suffix = c("Event", "Target") | ||
| ) | ||
|
|
||
| # Attrition: persons with no event records (may appear as 'None' paths if nonePaths = TRUE) | ||
| attrCounts <- fetchAttritionCounts(andromeda, "treatmentHistoryJoined") | ||
| appendAttrition( | ||
| toAdd = data.frame( | ||
| number_records = attrCounts$nRecords, | ||
| number_subjects = attrCounts$nSubjects, | ||
| reason_id = 4, | ||
| reason = "Subjects with no event records (may appear as 'None' paths)" | ||
| ), | ||
| andromeda = andromeda | ||
| ) | ||
|
|
||
| # Step 2: Filter to events within the time window | ||
| andromeda[[sprintf("cohortTable_%s", targetCohortId)]] <- andromeda$treatmentHistoryJoined %>% | ||
| dplyr::filter( | ||
| .data$indexDate <= .data$startDateEvent, | ||
| .data$startDateEvent <= .data$endDateTarget | ||
| ) |
There was a problem hiding this comment.
Joining targets/events on personId only and then filtering by dates can create a very large intermediate (cartesian within person when there are multiple targets and many events), which can severely impact runtime and memory on real CDM-scale data. Prefer pushing the date constraints into the join (non-equi join) or otherwise reduce the join size before materializing the combined table.
R/constructPathways.R
Outdated
| # Step 1: Equi-join on person to match events to targets | ||
| andromeda$treatmentHistoryJoined <- dplyr::inner_join( | ||
| x = andromeda$eventCohorts, | ||
| y = andromeda$targetCohorts, | ||
| by = dplyr::join_by( | ||
| personId == personId, | ||
| subject_id_origin == subject_id_origin, | ||
| y$indexDate <= x$startDate, | ||
| x$startDate <= y$endDate, | ||
| personId == personId | ||
| ), suffix = c("Event", "Target") | ||
| ) | ||
|
|
There was a problem hiding this comment.
andromeda$treatmentHistoryJoined is persisted in the returned Andromeda even though it is only used as an intermediate for the window filter. This can significantly bloat the returned object (and disk usage) for large runs. Consider removing it after cohortTable_<target> is created, or storing it only behind a debug/diagnostics option.
R/CDMInterface.R
Outdated
| number_records = sum(n), | ||
| number_subjects = length(n), | ||
| reason_id = 2, | ||
| reason = "Persons with no record in target cohort", |
There was a problem hiding this comment.
The attrition reason says “Persons with no record in target cohort”, but the preceding filter keeps only subjects that DO have at least one target cohort record. This makes the attrition output misleading; rename the reason to reflect the step (e.g., “After filtering to persons with >=1 target cohort record”).
| reason = "Persons with no record in target cohort", | |
| reason = "After filtering to persons with >=1 target cohort record", |
R/CDMInterface.R
Outdated
| # Filter to persons with at least one target cohort record | ||
| targetId <- as.numeric(targetCohortIds) | ||
|
|
||
| andromeda[[andromedaTableName]] <- andromeda[[andromedaTableName]] %>% | ||
| dplyr::mutate(cohort_definition_id = as.numeric(.data$cohort_definition_id)) %>% | ||
| dplyr::group_by(.data$subject_id) %>% | ||
| dplyr::filter(any(.data$cohort_definition_id %in% targetId, na.rm = TRUE)) %>% | ||
| dplyr::ungroup() |
There was a problem hiding this comment.
The target-cohort presence filter is applied before the minEraDuration filter. After removing short eras, some subjects may no longer have any target records, which contradicts the intent of the target filter and can skew attrition reporting. Apply minEraDuration filtering before the target filter or re-run the target filter after minEraDuration.
R/computePathways.R
Outdated
| assertions <- checkmate::makeAssertCollection() | ||
| checkmate::assertIntegerish(cohortTableHead$cohort_definition_id, add = assertions) | ||
| checkmate::assertIntegerish(cohortTableHead$subject_id, add = assertions) | ||
| checkmate::assertCharacter(cohortTableHead$subject_id, add = assertions) |
There was a problem hiding this comment.
Changing subject_id validation to character implies personId is now character throughout outputs. This is a breaking API change for downstream consumers and also conflicts with existing tests that expect numeric/integer personId (e.g., tests/testthat/test-computePathways.R around the filterTreatments test). Either preserve numeric personId (and add a separate origin/id column) or update the public contract + tests accordingly.
| checkmate::assertCharacter(cohortTableHead$subject_id, add = assertions) | |
| checkmate::assertIntegerish(cohortTableHead$subject_id, add = assertions) |
R/exportPatientLevel.R
Outdated
| dplyr::select("personId", "subject_id_origin") %>% | ||
| dplyr::inner_join(table, by = dplyr::join_by(personId == personId)) %>% | ||
| dplyr::distinct() | ||
| # personId already contains the original subject_id (as character) |
There was a problem hiding this comment.
addOriginId() no longer adds the original ID column to the exported patient-level CSV, which changes the treatment_history.csv schema (and will likely change ncol() expectations in tests/testthat/test-exportPatientLevel.R). If the intent is to drop subject_id_origin, update the export contract/tests; otherwise keep exporting a dedicated origin identifier column separate from personId.
| # personId already contains the original subject_id (as character) | |
| # Ensure a dedicated origin identifier column is present for export. | |
| # personId already contains the original subject_id (as character), | |
| # so if subject_id_origin is missing we create it from personId. | |
| if (!"subject_id_origin" %in% colnames(table)) { | |
| table <- dplyr::mutate(table, subject_id_origin = .data$personId) | |
| } |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
@ablack3 I'm fully behind the extra attrition steps. Could you split that out into a separate PR? I still have to think about the other changes. Edit; also you could you let Copilot re-review after the changes you made on its initial feedback? |
update attrition recording. preserve person id. remove inequality full join in favor of inner join followed by filter.