An R toolkit for streamlined model development, training, and forecasting at scale.
mdlr provides a consistent, tidy-friendly workflow to:
- load and prepare regression-ready datasets (including column selection and date filtering)
- generate rolling/expanding training windows and forecast dates
- fit a wide array of models with a single interface (base R, parsnip engines including glmnet, mgcv, rstanarm, and more)
- iterate over model permutations (formulas, parameters, dates, and sub-models)
- export fitted model artifacts and predictions to disk in tidy, partitioned layouts
Core exported functions (see @man/): fit_mdl, train_and_predict, gather_regression_source, generate_training_dates, generate_stat_grp, generate_model_permutations, iterate_train_and_predict, iterate_stat_grp, and export_mdl_stats. See @DESCRIPTION for dependencies and package metadata.
- Unified modeling interface: Use the same
fit_mdl/train_and_predictsurface across base R models andparsnipmodels and engines (e.g.,lm,glmnet,mgcv,rstanarm). - Dataset sourcing for regression:
gather_regression_sourceautomatically loads Hive-style partitioned data (e.g., viaarrow::open_dataset) and re-attaches meta columns for modeling with precise column selection helpers. - Time-window generation:
generate_training_datescreates rolling or expanding backtest windows with forward forecast dates, driven by first/last train dates and window sizes. - Statistical grouping:
generate_stat_grpbuilds hierarchical statistical groups with configurable thresholds and naming, enabling robust segment-based modeling. - Batch experimentation:
generate_model_permutationsanditerate_train_and_predicthelp cross model parameters with backtest dates and iterate training/forecasting runs. - Tidy exports at scale: Write
tidy,glance, andforecastoutputs using pluggable writers (e.g.,arrow::write_dataset,readr::write_rds) with optional partitioning byMODEL_ID,DATE,SUB_MODEL, etc.
mdlr is an R package. From the project root directory:
# Option 1: Install from source in this folder
install.packages("devtools")
devtools::install(local = TRUE)
# Option 2: Develop locally without install
devtools::load_all()Required imports are listed in @DESCRIPTION (e.g., dplyr, glmnet, lubridate, magrittr, assertthat, butcher, etlr). Some functionality uses additional packages if you opt into them:
- For parquet/dataset IO and partitioned exports:
arrow - For unified modeling engines:
parsnip - For summaries:
broom,broom.mixed - For writing RDS:
readr - For examples in tests:
fs,rstanarm,mgcv,glmnet
Install as needed, for example:
install.packages(c("arrow", "parsnip", "broom", "readr", "fs"))Below are distilled examples inspired by @tests/ to help you get started quickly.
- Fit a linear model (base R) via
fit_mdl:
library(mdlr)
library(dplyr)
orig_frame <- tibble::tibble(
ID3 = 1:60,
N1 = sqrt(ID3),
N2 = sqrt(N1),
N3 = sqrt(N2) + rnorm(60)
)
fit <- fit_mdl(
.data = orig_frame,
.mdl_formula = formula("ID3 ~ N1 + N2 + N3"),
.mdl_fxn = stats::lm
)
broom::tidy(fit)- Fit elastic net via parsnip + glmnet:
fit_en <- fit_mdl(
.data = orig_frame,
.mdl_formula = formula("ID3 ~ N1 + N2 + N3"),
.mdl_fxn = parsnip::linear_reg,
.mdl_parameters = list(penalty = double(1), mixture = double(1)),
.engine_parameters = list(engine = "glmnet")
)- Train and export model artifacts:
base_dir <- etlr::create_temp_dir()
mdl_id <- "example_mdl"
mdl <- train_and_predict(
.data_train = orig_frame,
.mdl_formula = formula("ID3 ~ N3"),
.mdl_fxn = parsnip::linear_reg,
.mdl_id = mdl_id,
.tidy_file_path = file.path(base_dir, "TIDY"),
.glance_file_path = file.path(base_dir, "GLANCE"),
.export_stat_fxn = arrow::write_dataset,
.partitioning = c("MODEL_ID", "DATE", "SUB_MODEL"),
.score_index_columns = c("DATE", "SYMBOL")
)
# Later, read exports
arrow::open_dataset(file.path(base_dir, "TIDY")) |> dplyr::collect()
arrow::open_dataset(file.path(base_dir, "GLANCE")) |> dplyr::collect()- Forecast export (train on a subset, score on future date):
train_data <- orig_frame |> dplyr::mutate(DATE = as.Date("2024-10-18"))
forecast_data <- orig_frame |> dplyr::mutate(DATE = as.Date("2024-10-25"))
forecast_dir <- file.path(base_dir, "FORECAST")
mdl <- train_and_predict(
.data_train = train_data,
.data_forecast = forecast_data,
.mdl_formula = formula("ID3 ~ N3"),
.mdl_fxn = parsnip::linear_reg,
.mdl_id = mdl_id,
.mdl_forecast_folder = forecast_dir,
.score_index_columns = c("DATE", "SYMBOL"),
.partitioning = c("MODEL_ID", "DATE")
)
arrow::open_dataset(forecast_dir) |> dplyr::collect()- Generate rolling/expanding training windows:
dates <- generate_training_dates(
.first_train_date = lubridate::as_date("2005-01-01"),
.last_train_date = lubridate::as_date("2005-03-30"),
.training_window_weeks = 5,
.increment_by = "1 week",
.fwd_forecast_weeks = 1,
.rolling = TRUE # set FALSE for expanding
)- Build hierarchical statistical groups:
stat_grp <- generate_stat_grp(
.data = dplyr::tibble(RS_SUBIND = "SM1", RS_INDUSTRY = "MD1", RS_INDGRP = "LG1", RS_SECTOR = "XX1"),
.stat_grouping_hierarchy = c("RS_SUBIND", "RS_INDUSTRY", "RS_INDGRP", "RS_SECTOR"),
.stat_grouping_threshold = 25,
.default_prefix = "STAT"
)- Source regression-ready data from Hive-style partitions:
src <- gather_regression_source(
.datapath = "/path/to/dataset",
.load_fxn = arrow::open_dataset,
.context_vars = \() dplyr::any_of(c("DATE", "GROUP", "SYMBOL")),
.response_vars = \() dplyr::any_of(c("y")),
.covariate_vars = \() dplyr::matches("^x\\d$") ,
.min_train_date = lubridate::as_date("2024-01-01"),
.forecast_date = lubridate::as_date("2024-12-01")
)For more examples, see the test files in @tests/.
Contributions are welcome! Typical flow:
- Open an issue describing the enhancement or bug.
- Create a feature branch from
main. - Add tests under
tests/testthat/demonstrating the change. - Ensure R CMD check passes locally.
- Submit a pull request with a concise description and rationale.
Please keep code readable, prefer tidy principles, and ensure exports remain stable for downstream consumers. When adding features that rely on optional packages (e.g., arrow, parsnip), gate usage behind parameters and document clearly in @man/.