Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 51 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,9 @@ For performance details, see `performance_validity.R` in the `extdata` folder.
- [Introduction and Advanced Usage](https://polkas.github.io/miceFast/articles/miceFast-intro.html)
- [Missing Data Mechanisms and Multiple Imputation](https://polkas.github.io/miceFast/articles/missing-data-and-imputation.html)

## Multiple Imputation and mice
## Multiple Imputation Workflow

[mice](https://cran.r-project.org/package=mice) implements the full MI pipeline (impute, analyze, pool). **miceFast** focuses on the computationally expensive part: fitting the imputation models. Two usage modes:
[mice](https://cran.r-project.org/package=mice) implements the full MI pipeline (impute, analyze, pool). **miceFast** focuses on the computationally expensive partfitting the imputation models — and is typically **~10× faster** than mice for the imputation step alone (see [benchmarks](#performance-highlights)). Two usage modes:

1. **MI with Rubin's rules** — call `fill_NA()` with a stochastic model (`lm_bayes`, `lm_noise`, or `lda` with a random `ridge`) in a loop to create *m* completed datasets, then `pool()` the fitted models.

Expand All @@ -50,6 +50,8 @@ devtools::install_github("polkas/miceFast")

## Quick Example

### dplyr

```r
library(miceFast)
library(dplyr)
Expand All @@ -60,10 +62,7 @@ data(air_miss)
# Visualize the NA structure
upset_NA(air_miss, 6)

# Naive imputation (quick, but biased — see ?naive_fill_NA)
naive_fill_NA(air_miss)

# Model-based single imputation with fill_NA
# Model-based single imputation
air_miss %>%
mutate(Ozone_imp = fill_NA(
x = ., model = "lm_bayes",
Expand All @@ -80,9 +79,46 @@ completed <- lapply(1:5, function(i) {
})
fits <- lapply(completed, function(d) lm(Ozone_imp ~ Wind + Temp, data = d))
pool(fits)
#> Pooled results from 5 imputed datasets
#> Rubin's rules with Barnard-Rubin df adjustment
#>
#> term estimate std.error statistic df p.value
#> (Intercept) -62.771 23.9022 -2.626 46.95 1.162e-02
#> Wind -3.087 0.6857 -4.502 37.24 6.420e-05
#> Temp 1.736 0.2498 6.951 58.54 3.400e-09
```

### data.table

```r
library(miceFast)
library(data.table)

set.seed(1234)
data(air_miss)
setDT(air_miss)

# Single imputation
air_miss[, Ozone_imp := fill_NA(
x = .SD, model = "lm_bayes",
posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp")
)]

# Grouped imputation — fits a separate model per group
air_miss[, Solar_R_imp := fill_NA(
x = .SD, model = "lm_bayes",
posit_y = "Solar.R", posit_x = c("Wind", "Temp", "Intercept")
), by = .(groups)]
```

### Naive imputation (baseline only)

```r
# Quick baseline — biased, does not account for relationships between variables
naive_fill_NA(air_miss)
```

See the [Introduction vignette](https://polkas.github.io/miceFast/articles/miceFast-intro.html) for grouped imputation, data.table syntax, the OOP interface, and more.
See the [Introduction vignette](https://polkas.github.io/miceFast/articles/miceFast-intro.html) for weights, the OOP interface, log-transformations, and more.

---

Expand Down Expand Up @@ -113,6 +149,14 @@ See the [Introduction vignette](https://polkas.github.io/miceFast/articles/miceF

---

## Practical Advice

- **Little missing data + MCAR?** Consider using `complete.cases()` — listwise deletion is unbiased under MCAR and may be sufficient when the fraction of incomplete rows is small.
- **For publication**, always run a **sensitivity analysis**: compare MI results against base methods (`complete.cases()`, mean imputation) and across different imputation models (`lm_bayes`, `lm_noise`, `pmm`). Vary the number of imputations. If conclusions change, investigate why. Report the imputation model, *m*, and any assumptions about the missing-data mechanism.
- See the [MI vignette](https://polkas.github.io/miceFast/articles/missing-data-and-imputation.html) for details on MCAR/MAR/MNAR mechanisms and a practical checklist.

---

## Performance Highlights

Median timings on 100k rows, 10 variables, 100 groups (R 4.4.3, macOS M3 Pro, [optimized BLAS/LAPACK](https://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html#Which-BLAS-is-used-and-how-can-it-be-changed_003f)):
Expand Down
94 changes: 40 additions & 54 deletions vignettes/miceFast-intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,9 @@ upset_NA(air_miss, 6)

## Checking for Collinearity

Before imputing, check Variance Inflation Factors. Values above ~10 suggest problematic collinearity.
Before imputing, check Variance Inflation Factors. Values above ~10 suggest
problematic collinearity that can destabilize imputation models — consider
dropping or combining redundant predictors before imputing.

```{r vif}
VIF(
Expand Down Expand Up @@ -140,7 +142,8 @@ head(result_grouped[, c("Solar.R", "Solar_R_imp", "groups")])

## Log-transformation

For right-skewed variables (like Ozone), use `logreg = TRUE` to impute on the log scale:
For right-skewed variables (like Ozone), use `logreg = TRUE` to impute on the log scale.
The model fits on $\log(y)$ and back-transforms the predictions:

```{r fill-na-logreg}
data(air_miss)
Expand All @@ -153,11 +156,15 @@ result_log <- air_miss %>%
posit_x = c("Solar.R", "Wind", "Temp", "Intercept"),
logreg = TRUE
))

# Compare distributions: log imputation avoids negative values
summary(result_log[c("Ozone", "Ozone_imp")])
```

## Using column position indices

You can refer to columns by position instead of name:
You can refer to columns by position instead of name.
Check `names(air_miss)` to find the right positions:

```{r fill-na-position}
data(air_miss)
Expand All @@ -170,7 +177,8 @@ result_pos <- air_miss %>%
posit_x = c(4, 6),
logreg = TRUE
))
```

head(result_pos[, c("Ozone", "Ozone_imp")])

## Basic usage (data.table)

Expand Down Expand Up @@ -309,37 +317,6 @@ pool_res
summary(pool_res)
```

## MI with continuous and categorical variables

For a complete MI workflow imputing both continuous and categorical variables:

```{r mi-mixed}
data(air_miss)

impute_data <- function(data) {
data %>%
mutate(
Solar_R_imp = fill_NA(
x = ., model = "lm_bayes",
posit_y = "Solar.R",
posit_x = c("Wind", "Temp", "Intercept"),
w = weights
),
Ozone_chac_imp = fill_NA(
x = ., model = "lda",
posit_y = "Ozone_chac",
posit_x = c("Wind", "Temp"),
ridge = runif(1, 0, 50) # random ridge makes LDA stochastic
)
)
}

set.seed(42)
res <- replicate(n = 5, expr = impute_data(air_miss), simplify = FALSE)
fits <- lapply(res, function(d) lm(Solar_R_imp ~ Wind + Temp, data = d))
pool(fits)
```

---

# Full Imputation: Filling All Variables and MI with Rubin's Rules
Expand Down Expand Up @@ -469,21 +446,19 @@ use `get_index()` to recover the original row order.

## Simple example

```{r oop-simple, eval=requireNamespace("mice", quietly=TRUE)}
data <- cbind(as.matrix(mice::nhanes), intercept = 1, index = 1:nrow(mice::nhanes))
```{r oop-simple}
data <- cbind(as.matrix(airquality[, 1:4]), intercept = 1, index = 1:nrow(airquality))
model <- new(miceFast)
model$set_data(data)

# Single imputation with linear model
model$update_var(2, model$impute("lm_pred", 2, 5)$imputations)

# LDA for a categorical variable
model$update_var(3, model$impute("lda", 3, c(1, 2))$imputations)
# Single imputation with linear model (col 1 = Ozone)
model$update_var(1, model$impute("lm_pred", 1, 5)$imputations)

# Averaged multiple imputation (Bayesian, k=10 draws)
model$update_var(4, model$impute_N("lm_bayes", 4, c(1, 2, 3), k = 10)$imputations)
# Averaged multiple imputation for Solar.R (col 2, Bayesian, k=10 draws)
model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5), k = 10)$imputations)

model$which_updated()
head(model$get_data(), 4)
```

## With weights and groups
Expand Down Expand Up @@ -565,17 +540,28 @@ summary(pool(fits))

# Generating Correlated Data with `corrData`

The `corrData` module generates correlated data for simulations:

```r
# Constructors:
new(corrData, nr_cat, n_obs, means, cor_matrix)
new(corrData, n_obs, means, cor_matrix)
The `corrData` module generates correlated data for simulations.
This is useful for creating test datasets with known properties.

```{r corrdata-example}
# 3 continuous variables, 100 observations
means <- c(10, 20, 30)
cor_matrix <- matrix(c(
1.0, 0.7, 0.3,
0.7, 1.0, 0.5,
0.3, 0.5, 1.0
), nrow = 3)

cd <- new(corrData, 100, means, cor_matrix)
sim_data <- cd$fill("contin")
round(cor(sim_data), 2)
```

# Methods:
cd_obj$fill("contin") # continuous data
cd_obj$fill("binom") # binary data
cd_obj$fill("discrete") # multi-category discrete data
```{r corrdata-discrete}
# With 2 categorical variables: first argument is nr_cat
cd2 <- new(corrData, 2, 200, means, cor_matrix)
sim_discrete <- cd2$fill("discrete")
head(sim_discrete)
```

---
Expand Down
Loading