[PULL REQUEST] Updating Mortality Calculations#104
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request updates the mortality calculation methodology for the Cohort Component Model (CCM) by implementing procedures from the Mortality Prediction Project (MPP). The changes primarily involve removing outdated CDC mortality data files from 2010-2020 and replacing the calculation approach with new CDC-related functions and splining techniques from the MPP.
Changes:
- Removal of legacy CDC mortality data files for various demographic groups across multiple years (2010-2020)
- Implementation of new mortality calculation procedures following the Mortality Prediction Project methodology
- Integration of CDC-related functions and splining techniques into existing CCM functions
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
GregorSchroeder
left a comment
There was a problem hiding this comment.
This is initial feedback. I think once these issues are resolved we can move into simplifying the whole process as it seems overly complex at the moment. With all these files and transformations, I think it would behoove us to move towards loading in mortality for the year of interest only as opposed to doing a mass loading and then subsequent filtering in get_death_rates.
data folder
- We only need data back to 2010, remove all the pre-2010 datasets. Calling a folder "2018_2023" when it only contains 2022+ is confusing. Also, what is going to happen to that folder name when it includes 2024? I would suggest either not differentiating by product in separate folders, because they don't overlap in terms of uploaded data, or using more generic folder names.
- Need to add a
.gitattributesfile so we aren't doing full track changes for all of these mortality rate files. See https://github.com/SANDAG/CA-DOF/blob/main/.gitattributes.
parsing file names
- Why is there a mix of
.csvand.txtfiles that necessitates the use ofget_file_separator? - Do we need to have all these abbreviations and mappings in
parse_filenameor could we just use full names and mappings? Why separate the labels and maps invalid_config, couldn't it just usemapand check for valid labels?[str(y) for y in range(1999, 2024)]introduces a year-restriction that should not be in place here
parsing and transforming files
- replace usages of
.querywith.loc - is
parse_not_statedactually using the input dataframe it takes in? it looks like it just overwrites it immediately?
CDC underwent an update in which they no longer allow |
GregorSchroeder
left a comment
There was a problem hiding this comment.
Why was the use of the logger removed and substituted with print statements? We want to write out warnings, messages, etc.. to the log file not to print statements.
I think there is a lot of complication right now in the mortality code and most of it might be due to the file/folder structure. I think moving back to a simpler folder structure of just deaths with years as subfolders will 1) provide more organization and clarity as to what data sources are used to calculate mortality rates for a given year and 2) simplify the code allowing us to pass year as a parameter and fully isolate our calculations within a single year. Your filenames are doing all the heavy lifting in terms of identifying product and geographic grain so we don't need subfolders to provide further identification. I was thinking of pseudocode as follows under this new folder structure.
- Create a function that takes in a single file path
load_cdc_wonder- Check file name is consistent with pattern and return implied metadata
validate_file_name - Load the file and check against the implied metadata
- Load the Stated/Not Stated and apply inflation factor to deaths
inflate_deaths
- Check file name is consistent with pattern and return implied metadata
- Wrapper function that takes a given year and calls the above function for each geography level
parse_cdc_wonder- Apply substitution methodology
rate_substitution - Apply smoothing
smooth_rates - If a given year's product is the old CDC WONDER (can tell from file names or even just a hardcode year value) then take the "All" races data and duplicate it for "NHPI" and "Two or More"
- Apply substitution methodology
GregorSchroeder
left a comment
There was a problem hiding this comment.
Let's concentrate on naming, parsing, and validating the files then making sure we are using the metadata and a single year to drive everything.
| "ASIAN": "Asian", | ||
| "BAA": "Black or African American", | ||
| "MOR": "More than one race", | ||
| "HIS": "Hispanic", |
There was a problem hiding this comment.
There should be no Hispanic race.
| "NHPI": "Native Hawaiian or Other Pacific Islander", | ||
| "WH": "White", | ||
| "ALL": "All", | ||
| "NA": "Not Available", |
There was a problem hiding this comment.
At your direction we are not exporting the NAs as they are all 0s and do not need to be incorporated into Not Stated inflation factor. Remove.
| f"Metadata hispanic: '{metadata['hispanic']}' does not match file contents." | ||
| ) | ||
|
|
||
| # Race validation is done explicitly excepting for All and Hispanic |
There was a problem hiding this comment.
there is no Hispanic race remove logic
| ) | ||
| ) | ||
|
|
||
| # Separate data by status (check if age_group is "SYA NS") |
There was a problem hiding this comment.
what about the hispanic not stated?
| # Use unified load_cdc_wonder function | ||
| df = load_cdc_wonder(file_path) | ||
| if not df.empty: | ||
| location = df["location"].iloc[0] |
There was a problem hiding this comment.
just use the metadata no need to set a new variable
| # Population for 2022+ county level is suppressed and will be replaced | ||
| # with yearly CCM population estimates which require annual death counts to | ||
| # calculate rates | ||
| deaths=lambda x: np.where( |
There was a problem hiding this comment.
instead of year and location shouldn't we just use the metadata product? this is consistent with later usage in this file
| and location == "San Diego County" | ||
| and pop_df is not None | ||
| ): | ||
| pop_df_with_year = pop_df.copy() |
There was a problem hiding this comment.
just do it all in the df assignment
why do the data types need to be set in pop_df?
| stated_df = df.loc[ | ||
| lambda x: (x["hispanic origin"] != "Not Stated") | ||
| & (x["age"] != "NS") | ||
| ] |
There was a problem hiding this comment.
shouldn't this already be filtered in the load_cdc_wonder?
| pd.DataFrame: A single DataFrame for the specified year with data for | ||
| ages 0-99 from the CDC. | ||
| """ | ||
|
|
There was a problem hiding this comment.
I would simply declare a list and append to it then use pd.concat later
we have the location field so there's not need for dictionary and we are using year-specific loading so no need for product specification although that is also a field
| county = pd.concat(county_dfs, ignore_index=True) | ||
| state = pd.concat(state_dfs, ignore_index=True) | ||
| national = pd.concat(national_dfs, ignore_index=True) | ||
| substituted_df = merge_geographies(county, state, national) |
There was a problem hiding this comment.
instead of a function couldn't we take the single combined dataframe with the location field and put the rates in wide format by location then just calculate our rate using a conditional is null? you had said previously that the substitution methodology is sufficient so we can remove all the rate shifting and imputation logic
Describe this pull request. What changes are being made?
This pull request alters how mortality is calculated for the CCM. This pr mainly follows the procedures from Mortality Prediction Project (MPP), described in Issue 95. It then implements the previous CDC-related functions and splining techniques in the MPP, in addition to amending existing CCM functions.
A successful demo can be seen in
run_id = 4in[CohortComponentModel].[outputs].[rates].What issues does this pull request address?
closes #95
Additional context
refer to Mortality Prediction Project and Issue 95 for further detail