Skip to content

krisoye/mdlr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mdlr

An R toolkit for streamlined model development, training, and forecasting at scale.

Description

mdlr provides a consistent, tidy-friendly workflow to:

  • load and prepare regression-ready datasets (including column selection and date filtering)
  • generate rolling/expanding training windows and forecast dates
  • fit a wide array of models with a single interface (base R, parsnip engines including glmnet, mgcv, rstanarm, and more)
  • iterate over model permutations (formulas, parameters, dates, and sub-models)
  • export fitted model artifacts and predictions to disk in tidy, partitioned layouts

Core exported functions (see @man/): fit_mdl, train_and_predict, gather_regression_source, generate_training_dates, generate_stat_grp, generate_model_permutations, iterate_train_and_predict, iterate_stat_grp, and export_mdl_stats. See @DESCRIPTION for dependencies and package metadata.

Features

  • Unified modeling interface: Use the same fit_mdl/train_and_predict surface across base R models and parsnip models and engines (e.g., lm, glmnet, mgcv, rstanarm).
  • Dataset sourcing for regression: gather_regression_source automatically loads Hive-style partitioned data (e.g., via arrow::open_dataset) and re-attaches meta columns for modeling with precise column selection helpers.
  • Time-window generation: generate_training_dates creates rolling or expanding backtest windows with forward forecast dates, driven by first/last train dates and window sizes.
  • Statistical grouping: generate_stat_grp builds hierarchical statistical groups with configurable thresholds and naming, enabling robust segment-based modeling.
  • Batch experimentation: generate_model_permutations and iterate_train_and_predict help cross model parameters with backtest dates and iterate training/forecasting runs.
  • Tidy exports at scale: Write tidy, glance, and forecast outputs using pluggable writers (e.g., arrow::write_dataset, readr::write_rds) with optional partitioning by MODEL_ID, DATE, SUB_MODEL, etc.

Installation

mdlr is an R package. From the project root directory:

# Option 1: Install from source in this folder
install.packages("devtools")
devtools::install(local = TRUE)

# Option 2: Develop locally without install
devtools::load_all()

Required imports are listed in @DESCRIPTION (e.g., dplyr, glmnet, lubridate, magrittr, assertthat, butcher, etlr). Some functionality uses additional packages if you opt into them:

  • For parquet/dataset IO and partitioned exports: arrow
  • For unified modeling engines: parsnip
  • For summaries: broom, broom.mixed
  • For writing RDS: readr
  • For examples in tests: fs, rstanarm, mgcv, glmnet

Install as needed, for example:

install.packages(c("arrow", "parsnip", "broom", "readr", "fs"))

Usage

Below are distilled examples inspired by @tests/ to help you get started quickly.

  • Fit a linear model (base R) via fit_mdl:
library(mdlr)
library(dplyr)

orig_frame <- tibble::tibble(
  ID3 = 1:60,
  N1 = sqrt(ID3),
  N2 = sqrt(N1),
  N3 = sqrt(N2) + rnorm(60)
)

fit <- fit_mdl(
  .data = orig_frame,
  .mdl_formula = formula("ID3 ~ N1 + N2 + N3"),
  .mdl_fxn = stats::lm
)

broom::tidy(fit)
  • Fit elastic net via parsnip + glmnet:
fit_en <- fit_mdl(
  .data = orig_frame,
  .mdl_formula = formula("ID3 ~ N1 + N2 + N3"),
  .mdl_fxn = parsnip::linear_reg,
  .mdl_parameters = list(penalty = double(1), mixture = double(1)),
  .engine_parameters = list(engine = "glmnet")
)
  • Train and export model artifacts:
base_dir <- etlr::create_temp_dir()
mdl_id <- "example_mdl"

mdl <- train_and_predict(
  .data_train = orig_frame,
  .mdl_formula = formula("ID3 ~ N3"),
  .mdl_fxn = parsnip::linear_reg,
  .mdl_id = mdl_id,
  .tidy_file_path = file.path(base_dir, "TIDY"),
  .glance_file_path = file.path(base_dir, "GLANCE"),
  .export_stat_fxn = arrow::write_dataset,
  .partitioning = c("MODEL_ID", "DATE", "SUB_MODEL"),
  .score_index_columns = c("DATE", "SYMBOL")
)

# Later, read exports
arrow::open_dataset(file.path(base_dir, "TIDY")) |> dplyr::collect()
arrow::open_dataset(file.path(base_dir, "GLANCE")) |> dplyr::collect()
  • Forecast export (train on a subset, score on future date):
train_data <- orig_frame |> dplyr::mutate(DATE = as.Date("2024-10-18"))
forecast_data <- orig_frame |> dplyr::mutate(DATE = as.Date("2024-10-25"))

forecast_dir <- file.path(base_dir, "FORECAST")

mdl <- train_and_predict(
  .data_train = train_data,
  .data_forecast = forecast_data,
  .mdl_formula = formula("ID3 ~ N3"),
  .mdl_fxn = parsnip::linear_reg,
  .mdl_id = mdl_id,
  .mdl_forecast_folder = forecast_dir,
  .score_index_columns = c("DATE", "SYMBOL"),
  .partitioning = c("MODEL_ID", "DATE")
)

arrow::open_dataset(forecast_dir) |> dplyr::collect()
  • Generate rolling/expanding training windows:
dates <- generate_training_dates(
  .first_train_date = lubridate::as_date("2005-01-01"),
  .last_train_date = lubridate::as_date("2005-03-30"),
  .training_window_weeks = 5,
  .increment_by = "1 week",
  .fwd_forecast_weeks = 1,
  .rolling = TRUE # set FALSE for expanding
)
  • Build hierarchical statistical groups:
stat_grp <- generate_stat_grp(
  .data = dplyr::tibble(RS_SUBIND = "SM1", RS_INDUSTRY = "MD1", RS_INDGRP = "LG1", RS_SECTOR = "XX1"),
  .stat_grouping_hierarchy = c("RS_SUBIND", "RS_INDUSTRY", "RS_INDGRP", "RS_SECTOR"),
  .stat_grouping_threshold = 25,
  .default_prefix = "STAT"
)
  • Source regression-ready data from Hive-style partitions:
src <- gather_regression_source(
  .datapath = "/path/to/dataset",
  .load_fxn = arrow::open_dataset,
  .context_vars = \() dplyr::any_of(c("DATE", "GROUP", "SYMBOL")),
  .response_vars = \() dplyr::any_of(c("y")),
  .covariate_vars = \() dplyr::matches("^x\\d$") ,
  .min_train_date = lubridate::as_date("2024-01-01"),
  .forecast_date = lubridate::as_date("2024-12-01")
)

For more examples, see the test files in @tests/.

Contributing

Contributions are welcome! Typical flow:

  • Open an issue describing the enhancement or bug.
  • Create a feature branch from main.
  • Add tests under tests/testthat/ demonstrating the change.
  • Ensure R CMD check passes locally.
  • Submit a pull request with a concise description and rationale.

Please keep code readable, prefer tidy principles, and ensure exports remain stable for downstream consumers. When adding features that rely on optional packages (e.g., arrow, parsnip), gate usage behind parameters and document clearly in @man/.

About

An R toolkit for streamlined model development, training, and forecasting at scale.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages