-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
This shows up in a number of places, including regressions or similar behavior from several previous issues involving "haven_labelled" user missings.
Data for reprex:
vals <- c(1, 2, -9)
dat <- data.frame(
num_na_vals = haven::labelled_spss(vals, na_values = -9),
num_na_range = haven::labelled_spss(vals, na_range = c(-9, -1)),
cat_na_vals_unlabelled = haven::labelled_spss(
vals,
labels = c(Yes = 1, No = 2),
na_values = -9
),
cat_na_vals_labelled = haven::labelled_spss(
vals,
labels = c(Yes = 1, No = 2, Refused = -9),
na_values = -9
),
cat_na_range = haven::labelled_spss(
vals,
labels = c(Yes = 1, No = 2),
na_range = c(-9, -1)
),
txt_na_vals = haven::labelled_spss(as.character(vals), na_values = "-9")
)
cb <- cb_create_spss(dat)
# Warning message:
# ! User missing ranges will be treated as discrete user missing values.No missing values in "user_missing" attribute. This is likely the root of all the below problems.
attr(cb, "user_missing")
# list()The overview show behavior similar to #23, but now affecting both discrete and range user missings.
## no user missing values column
## but user missings are correctly counted towards % missing
cb
# # A tibble: 6 × 5
# name type label values missing
# <chr> <chr> <chr> <chr> <dbl>
# 1 num_na_vals numeric NA NA 0.333
# 2 num_na_range numeric NA NA 0.333
# 3 cat_na_vals_unlabelled categorical NA [1] Yes; [2] No 0.333
# 4 cat_na_vals_labelled categorical NA [1] Yes; [2] No; [-9] Refused 0.333
# 5 cat_na_range categorical NA [1] Yes; [2] No 0.333
# 6 txt_na_vals text NA NA 0.333The categorical summary shows behavior similar to #16 (though for detail_missing = FALSE, problems only occur if the user missing value is labelled).
## with default `detail_missing = TRUE`:
## `is_missing` is `FALSE` for user missings, and `pct_of_valid` / `pct_of_missing`
## columns therefore aren't correct
cb_summarize_categorical(cb)
# # A tibble: 9 × 8
# name label is_missing value n pct_of_all pct_of_valid pct_of_missing
# <chr> <chr> <lgl> <chr> <int> <dbl> <dbl> <dbl>
# 1 cat_na_va… NA FALSE "[1]… 1 0.333 0.333 NA
# 2 cat_na_va… NA FALSE "[2]… 1 0.333 0.333 NA
# 3 cat_na_va… NA FALSE "[-9… 1 0.333 0.333 NA
# 4 cat_na_va… NA FALSE "[1]… 1 0.333 0.333 NA
# 5 cat_na_va… NA FALSE "[2]… 1 0.333 0.333 NA
# 6 cat_na_va… NA FALSE "[-9… 1 0.333 0.333 NA
# 7 cat_na_ra… NA FALSE "[1]… 1 0.333 0.333 NA
# 8 cat_na_ra… NA FALSE "[2]… 1 0.333 0.333 NA
## if `detail_missing = FALSE`:
## summaries for `cat_na_vals_unlabelled` and `cat_na_range` are correct.
## summary for `cat_na_vals_labelled` correctly includes "(Missing)" row with
## n = 1, but also includes "[-9] Refused" as non-missing value with n = 0.
cb_summarize_categorical(cb, detail_missing = FALSE)
# # A tibble: 10 × 6
# name label value n pct_of_all pct_of_valid
# <chr> <chr> <chr> <int> <dbl> <dbl>
# 1 cat_na_vals_unlabelled NA [1] Yes 1 0.333 0.5
# 2 cat_na_vals_unlabelled NA [2] No 1 0.333 0.5
# 3 cat_na_vals_unlabelled NA (Missing) 1 0.333 NA
# 4 cat_na_vals_labelled NA [1] Yes 1 0.333 0.5
# 5 cat_na_vals_labelled NA [2] No 1 0.333 0.5
# 6 cat_na_vals_labelled NA [-9] Refused 0 0 0
# 7 cat_na_vals_labelled NA (Missing) 1 0.333 NA
# 8 cat_na_range NA [1] Yes 1 0.333 0.5
# 9 cat_na_range NA [2] No 1 0.333 0.5
# 10 cat_na_range NA (Missing) 1 0.333 NAIn the text summary, the user missing value is NA instead of "-9", but otherwise handled correctly by is_missing, pct_of_valid, and pct_of_missing.
## with default `detail_missing = TRUE`:
## `value` is `NA` instead of `"-9"`, but otherwise OK
cb_summarize_text(cb)
# # A tibble: 3 × 9
# name label is_missing unique_n value n pct_of_all pct_of_valid
# <chr> <chr> <lgl> <int> <chr> <int> <dbl> <dbl>
# 1 txt_na_vals NA FALSE 2 1 1 0.333 0.5
# 2 txt_na_vals NA FALSE 2 2 1 0.333 0.5
# 3 txt_na_vals NA TRUE NA NA 1 0.333 NA
## if `detail_missing = FALSE`, everything seems OK
cb_summarize_text(cb, detail_missing = FALSE)
# # A tibble: 3 × 7
# name label unique_n value n pct_of_all pct_of_valid
# <chr> <chr> <int> <chr> <int> <dbl> <dbl>
# 1 txt_na_vals NA 2 1 1 0.333 0.5
# 2 txt_na_vals NA 2 2 1 0.333 0.5
# 3 txt_na_vals NA NA (Missing) 1 0.333 NAEverything seems OK for numeric summary.
## User missings are correctly counted towards `n_valid` and `valid_pct`
## and excluded from stats.
cb_summarize_numeric(cb)
# # A tibble: 2 × 13
# name label valid_n valid_pct mean SD median MAD min max range skew
# <chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 num_… NA 2 0.667 1.5 0.707 1.5 0.741 1 2 1 0
# 2 num_… NA 2 0.667 1.5 0.707 1.5 0.741 1 2 1 0
# # ℹ 1 more variable: kurt <dbl>Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels