etlr - A Variety of R ETL Features
etlr is a comprehensive R package designed to streamline Extract, Transform, and Load (ETL) operations for data science and analytics workflows. It provides a rich set of tools for data transformation, file management, cloud storage integration, and feature engineering that simplifies common data processing tasks.
The package is built on top of popular R packages like dplyr, lubridate, and magrittr, providing a cohesive interface for data manipulation tasks. Whether you're working with local files, cloud storage (S3), or need to transform data for machine learning pipelines, etlr offers the functionality you need.
- Quantile Functions: Create quantile-based features with
tercile(),quintile(),fractile(), andquantile_column() - Date Filtering: Intelligent date filtering with
filter_date()supporting month-end and week-end filtering - Date Manipulation: Advanced date operations including
month_offset()andtime_shift_column() - Feature Engineering: Create ratios (
feature_ratio()), calculate percentages of totals (percent_of_total()), and apply pseudo-logarithmic transformations (pseudo_log(),pseudo_exp()) - Column Manipulation: Flexible column mutation with
mutate_columns()supporting type conversions and transformations - Value Replacement: Batch replace column values with
replace_column_values()
- Directory Serialization: Consolidate multiple files into a single dataframe with
serialize_directory()andconsolidate_files() - File Conversion: Convert files between formats (CSV, Feather, Parquet, etc.) with
convert_files_in_folder() - File Loading/Storing: Unified interface with
do_load()anddo_store()supporting multiple backends (readr, arrow, vroom, readxl) - Join Files: Horizontally join multiple files with
load_join_files() - Hive Path Support: Extract hive-style partition columns and open datasets with
extract_hive_columns()andopen_from_hive_path()
- S3 Operations: Comprehensive S3 support with
cache_from_s3(),cache_to_s3(),cache_file_from_s3(), andcache_file_to_s3() - S3 Directory Sync: Synchronize directories with
s3_sync_directory() - S3 File Operations: Copy files and list directories with
s3_copy_file()ands3_list_directory() - Safe Operations:
safely_cache_to_s3()provides error handling for cloud operations
- Column Management: Add columns (
add_column(),add_additional_columns()), append quarters (append_quarter()), and convert types (integer_as_character()) - Data Sampling: Quick subsampling with
quick_filter() - Utilities: Helper functions like
when_null()for NULL handling,lsos()for memory management, andcreate_temp_dir()for temporary directories - CLI Alerts: Built-in notification system with
cli_alert()
To install the development version of etlr from GitHub:
# Install devtools if you haven't already
install.packages("devtools")
# Install etlr
devtools::install_github("your-username/etlr")Alternatively, if you have the package source files:
# Install from local source
devtools::install(".", dependencies = TRUE)- R (>= 3.6.0)
- Required packages:
assertthat,dplyr,lubridate,magrittr - Optional but recommended:
arrow,readr,readxl,vroom,janitor,cli - For S3 functionality: AWS CLI must be installed and configured
# Clone the repository
git clone https://github.com/your-username/etlr.git
cd etlr
# Build and install
R CMD build .
R CMD INSTALL etlr_0.2.0.tar.gzlibrary(etlr)
library(dplyr)
# Create tercile bins
x <- runif(100)
terciles <- tercile(x, low_breakpoint = 0.3, high_breakpoint = 0.7)
table(terciles)
# Filter to month-end dates only
data <- tibble(timestamp_utc = seq(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "day"))
month_end_data <- filter_date(data, .filter_date_column = "timestamp_utc", .filter_eom = TRUE)
# Create quantile features
data <- data %>%
mutate(quantile_feature = quantile_column(ID3, .num_quantiles = 5))# Serialize multiple files into one dataframe
consolidated <- serialize_directory(
.source_dir = "path/to/data",
.target_file = "output.feather",
.load_fxn = readr::read_delim,
.store_fxn = arrow::write_feather
)
# Convert files from CSV to Feather format
convert_files_in_folder(
.source_dir = "input_csv/",
.target_dir = "output_feather/",
.old_extension = ".csv",
.new_extension = ".feather",
.load_fxn = readr::read_csv,
.store_fxn = arrow::write_feather
)
# Load files with type conversion
data <- serialize_directory(
.source_dir = "input/",
.char_cols = c("id", "category"),
.date_cols = c("date", "created_at"),
.integer_cols = c("count", "quantity")
)# Fetch data from S3
data <- cache_from_s3(
.bucket = "my-data-bucket",
.key = "path/to/data.parquet",
.profile = "my-aws-profile",
.load_fxn = arrow::read_parquet
)
# Upload data to S3
cache_file_to_s3(
.object = processed_data,
.bucket = "my-output-bucket",
.key = "results/results.feather",
.profile = "my-aws-profile",
.store_fxn = arrow::write_feather
)
# List S3 directory contents
s3_files <- s3_list_directory(
.bucket = "my-data-bucket",
.key = "data/",
.profile = "my-aws-profile",
.recursive = TRUE
)# Join multiple files
joined_data <- load_join_files(
.source_file_path = "data/",
.join_by = c("id", "date"),
.load_fxn = readr::read_csv,
.join_fxn = dplyr::left_join
)
# Extract hive partition columns and load
data <- open_from_hive_path(
.datapath = "s3://bucket/year=2024/month=01/day=15/",
.load_fxn = arrow::open_dataset
)
# Apply batch column value replacements
replacement_tibble <- tibble(
colname = c("status", "status"),
pattern = c("A", "B"),
replacement = c("Active", "Inactive")
)
data <- replace_column_values(data, .l = replacement_tibble)# Create ratio features
data <- data %>%
mutate(ratio = feature_ratio(numerator, denominator))
# Calculate percent of total
data <- data %>%
mutate(pct = percent_of_total(value))
# Apply pseudo-log transform (handles negative values)
data <- data %>%
mutate(log_feature = pseudo_log(value, .sigma = 1))
# Create fractile bins
data <- data %>%
mutate(risk_level = fractile(value, .levels = c("low", "medium", "high")))# Apply month offset
data <- data %>%
mutate(future_date = month_offset(date, .direction = "future", .num_months = 1))
# Time shift columns
data <- data %>%
time_shift_column(
.group_name = id,
.date_column = date,
.shift_columns = c(value, count),
.num_periods = -1,
.duration_fxn = lubridate::dweeks
)We welcome contributions to etlr! Here's how you can help:
- Fork the repository on GitHub
- Clone your fork locally
- Create a new branch for your feature (
git checkout -b feature/amazing-feature) - Make your changes and ensure tests pass
- Commit your changes with clear messages
- Push to your fork (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow R style guidelines and use
roxygen2for documentation - Add tests for new functionality in
tests/testthat/ - Update documentation in
man/directory - Keep functions focused and modular
- Use meaningful function and variable names
- Handle edge cases and provide informative error messages
# Run all tests
devtools::test()
# Run specific test file
testthat::test_file("tests/testthat/test-tercile.R")
# Check package
devtools::check()- Use the magrittr pipe (
|>) for readability - Prefix internal/helper functions with
. - Include comprehensive documentation with examples
- Follow existing patterns in the codebase
If you encounter bugs or have suggestions:
- Check existing issues to avoid duplicates
- Create a new issue with:
- Clear title and description
- Reproducible example
- Expected vs actual behavior
- Session info (
sessionInfo())
This package is licensed under MIT + file LICENSE. See the LICENSE file for details.
- Krisoye Smith - Author and Maintainer
- Email: krisoye.work@gmail.com
- Built with
dplyr,lubridate,arrow, and other excellent R packages - Inspired by the need for streamlined ETL workflows in data science
Current version: 0.2.0
For more detailed documentation on specific functions, see:
- Package documentation:
help(package = "etlr") - Individual function help:
?function_name - GitHub repository: [Link to repository]
- Issue tracker: [Link to issues]
Happy Data Wrangling! 🚀