update handling of global option of 'use_inf_as_na'#126
update handling of global option of 'use_inf_as_na'#126
Conversation
…inf within function
JaroCamphuijsen
left a comment
There was a problem hiding this comment.
Great that you're fixing this, because it is breaking things across other pr's as well. Also nice that we are losing the pandas import and the pd.set_option call in the global scope of this module. It's always a bit tricky to stick things in there and might lead to unexpected behavior, e.g. if you're only using one of the functions in your script and not loading the whole module, it will not execute the global command pd.set_option.
There are some minor issues with the consistency of handling np.inf and its documentation.
|
|
||
| # If there are any rows containing NULL cells, the NULL values will be filtered | ||
| if len(NaN_rows)>0: | ||
| if replace_NaN == 'remove': |
There was a problem hiding this comment.
Note that here (in the case replace_NaN == 'remove') we drop rows that contain either an by pandas na identified value (as specified here: https://pandas.pydata.org/docs/reference/api/pandas.isna.html#pandas.isna) OR np.inf (both negative and positive). This is because we use the mask that we produce ourselves in line 66, to remove the rows, it includes the np.inf values so also removes these.
However in the following cases ("zero", float|int value, "average", or "median") we use a pandas function (pd.DataFrame.isna()) which only selects the NaN values as identified by pandas.isna and not the np.inf values.
This makes the behavior of this function somewhat unclear and unpredictable. Perhaps we should completely ignore the inf values, or if we want, we can include them but should do that persistently. If you think it is still useful and a much encountered value, you could consider adding another argument to the function (regard_inf_as_na) and just do the replacement (with a DataFrame.replace({np.inf : np.nan, -np.inf : np.nan}) at the start of the function.
Do you think that would be a good solution?
| pd.set_option('mode.use_inf_as_na', True) | ||
|
|
||
| def filter_values(data_frame, | ||
| replace_NaN = 'remove', |
There was a problem hiding this comment.
I think that we could use a slight improvement in the wording of what this function does and make it more consistent throughout the function. Currently the way it handles np.inf values is not documented in the docstrings and it is not consistent for all values of the replace_NaN argument.
| data_frame : pd.dataframe | ||
| Tabular data containing variables to be evaluated with standard | ||
| column names and rows of sample data. | ||
| replace_NaN : string or float, default "remove" |
There was a problem hiding this comment.
Could we make this all lowercase instead? So replace_nan. It makes the function easier to use.
tests/test_transformation.py
Outdated
| """ | ||
| data_mod = self.data.copy() | ||
| data_mod.iloc[2,25]=np.inf | ||
| data_filter = filter_values(data_mod, |
There was a problem hiding this comment.
This currently only works if you set the replace_NaN argument to 'remove'. In al the other cases it will not produce the expected result. Or it will actually, because the filtering of inf values is not described in the docstrings.
Perhaps add a test for handling inf values with the other cases as well if we decide to keep the np.inf handling.
|
|
@jaro: good point. I tried first, but it did not work. Now, I managed to get it working with replacing inf by NaN with the command you suggest. During adapting of the testing, I realized that it is good to include handling of 'inf' values more generally. I adapted the example data now also including one inf-value. This required some adaptions in the testing and in the functions calculating average concentrations and non-zero value counts. |



The global option 'use_inf_as_na' is deprecated and will be removed in a future version of pandas (from 3.0 onwards).
Thus, modify the function "filter_values()" in transformation.py to not need the global conversation and handling detection of 'inf' values (similar to NaN) differently.