Skip to content

user missing values defined in haven_labelled vectors not consistently recognized #32

@ccsarapas

Description

@ccsarapas

This shows up in a number of places, including regressions or similar behavior from several previous issues involving "haven_labelled" user missings.

Data for reprex:

vals <- c(1, 2, -9)

dat <- data.frame(
  num_na_vals = haven::labelled_spss(vals, na_values = -9),
  num_na_range = haven::labelled_spss(vals, na_range = c(-9, -1)),
  cat_na_vals_unlabelled = haven::labelled_spss(
    vals, 
    labels = c(Yes = 1, No = 2),
    na_values = -9
  ),
  cat_na_vals_labelled = haven::labelled_spss(
    vals, 
    labels = c(Yes = 1, No = 2, Refused = -9),
    na_values = -9
  ),
  cat_na_range = haven::labelled_spss(
    vals, 
    labels = c(Yes = 1, No = 2),
    na_range = c(-9, -1)
  ),
  txt_na_vals = haven::labelled_spss(as.character(vals), na_values = "-9")
)

cb <- cb_create_spss(dat)
# Warning message:
# ! User missing ranges will be treated as discrete user missing values.

No missing values in "user_missing" attribute. This is likely the root of all the below problems.

attr(cb, "user_missing")
# list()

The overview show behavior similar to #23, but now affecting both discrete and range user missings.

## no user missing values column
## but user missings are correctly counted towards % missing
cb
# # A tibble: 6 × 5
#   name                   type        label values                        missing
#   <chr>                  <chr>       <chr> <chr>                           <dbl>
# 1 num_na_vals            numeric     NA    NA                              0.333
# 2 num_na_range           numeric     NA    NA                              0.333
# 3 cat_na_vals_unlabelled categorical NA    [1] Yes; [2] No                 0.333
# 4 cat_na_vals_labelled   categorical NA    [1] Yes; [2] No; [-9] Refused   0.333
# 5 cat_na_range           categorical NA    [1] Yes; [2] No                 0.333
# 6 txt_na_vals            text        NA    NA                              0.333

The categorical summary shows behavior similar to #16 (though for detail_missing = FALSE, problems only occur if the user missing value is labelled).

## with default `detail_missing = TRUE`:
## `is_missing` is `FALSE` for user missings, and `pct_of_valid` / `pct_of_missing` 
## columns therefore aren't correct
cb_summarize_categorical(cb)
# # A tibble: 9 × 8
#   name       label is_missing value     n pct_of_all pct_of_valid pct_of_missing
#   <chr>      <chr> <lgl>      <chr> <int>      <dbl>        <dbl>          <dbl>
# 1 cat_na_va… NA    FALSE      "[1]…     1      0.333        0.333             NA
# 2 cat_na_va… NA    FALSE      "[2]…     1      0.333        0.333             NA
# 3 cat_na_va… NA    FALSE      "[-9…     1      0.333        0.333             NA
# 4 cat_na_va… NA    FALSE      "[1]…     1      0.333        0.333             NA
# 5 cat_na_va… NA    FALSE      "[2]…     1      0.333        0.333             NA
# 6 cat_na_va… NA    FALSE      "[-9…     1      0.333        0.333             NA
# 7 cat_na_ra… NA    FALSE      "[1]…     1      0.333        0.333             NA
# 8 cat_na_ra… NA    FALSE      "[2]…     1      0.333        0.333             NA

## if `detail_missing = FALSE`:
## summaries for `cat_na_vals_unlabelled` and `cat_na_range` are correct. 
## summary for `cat_na_vals_labelled` correctly includes "(Missing)" row with 
## n = 1, but also includes "[-9] Refused" as non-missing value with n = 0.
cb_summarize_categorical(cb, detail_missing = FALSE)
# # A tibble: 10 × 6
#    name                   label value            n pct_of_all pct_of_valid
#    <chr>                  <chr> <chr>        <int>      <dbl>        <dbl>
#  1 cat_na_vals_unlabelled NA    [1] Yes          1      0.333          0.5
#  2 cat_na_vals_unlabelled NA    [2] No           1      0.333          0.5
#  3 cat_na_vals_unlabelled NA    (Missing)        1      0.333         NA
#  4 cat_na_vals_labelled   NA    [1] Yes          1      0.333          0.5
#  5 cat_na_vals_labelled   NA    [2] No           1      0.333          0.5
#  6 cat_na_vals_labelled   NA    [-9] Refused     0      0              0
#  7 cat_na_vals_labelled   NA    (Missing)        1      0.333         NA
#  8 cat_na_range           NA    [1] Yes          1      0.333          0.5
#  9 cat_na_range           NA    [2] No           1      0.333          0.5
# 10 cat_na_range           NA    (Missing)        1      0.333         NA

In the text summary, the user missing value is NA instead of "-9", but otherwise handled correctly by is_missing, pct_of_valid, and pct_of_missing.

## with default `detail_missing = TRUE`:
## `value` is `NA` instead of `"-9"`, but otherwise OK
cb_summarize_text(cb)
# # A tibble: 3 × 9
#   name        label is_missing unique_n value     n pct_of_all pct_of_valid
#   <chr>       <chr> <lgl>         <int> <chr> <int>      <dbl>        <dbl>
# 1 txt_na_vals NA    FALSE             2 1         1      0.333          0.5
# 2 txt_na_vals NA    FALSE             2 2         1      0.333          0.5
# 3 txt_na_vals NA    TRUE             NA NA        1      0.333         NA  

## if `detail_missing = FALSE`, everything seems OK
cb_summarize_text(cb, detail_missing = FALSE)
# # A tibble: 3 × 7
#   name        label unique_n value         n pct_of_all pct_of_valid
#   <chr>       <chr>    <int> <chr>     <int>      <dbl>        <dbl>
# 1 txt_na_vals NA           2 1             1      0.333          0.5
# 2 txt_na_vals NA           2 2             1      0.333          0.5
# 3 txt_na_vals NA          NA (Missing)     1      0.333         NA

Everything seems OK for numeric summary.

## User missings are correctly counted towards `n_valid` and `valid_pct`
## and excluded from stats.
cb_summarize_numeric(cb)
# # A tibble: 2 × 13
#   name  label valid_n valid_pct  mean    SD median   MAD   min   max range  skew
#   <chr> <chr>   <int>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 num_… NA          2     0.667   1.5 0.707    1.5 0.741     1     2     1     0
# 2 num_… NA          2     0.667   1.5 0.707    1.5 0.741     1     2     1     0
# # ℹ 1 more variable: kurt <dbl>

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions