etlr

Project Title

etlr - A Variety of R ETL Features

Description

etlr is a comprehensive R package designed to streamline Extract, Transform, and Load (ETL) operations for data science and analytics workflows. It provides a rich set of tools for data transformation, file management, cloud storage integration, and feature engineering that simplifies common data processing tasks.

The package is built on top of popular R packages like dplyr, lubridate, and magrittr, providing a cohesive interface for data manipulation tasks. Whether you're working with local files, cloud storage (S3), or need to transform data for machine learning pipelines, etlr offers the functionality you need.

Features

🔄 Data Transformation & Feature Engineering

Quantile Functions: Create quantile-based features with tercile(), quintile(), fractile(), and quantile_column()
Date Filtering: Intelligent date filtering with filter_date() supporting month-end and week-end filtering
Date Manipulation: Advanced date operations including month_offset() and time_shift_column()
Feature Engineering: Create ratios (feature_ratio()), calculate percentages of totals (percent_of_total()), and apply pseudo-logarithmic transformations (pseudo_log(), pseudo_exp())
Column Manipulation: Flexible column mutation with mutate_columns() supporting type conversions and transformations
Value Replacement: Batch replace column values with replace_column_values()

📁 File Operations & Serialization

Directory Serialization: Consolidate multiple files into a single dataframe with serialize_directory() and consolidate_files()
File Conversion: Convert files between formats (CSV, Feather, Parquet, etc.) with convert_files_in_folder()
File Loading/Storing: Unified interface with do_load() and do_store() supporting multiple backends (readr, arrow, vroom, readxl)
Join Files: Horizontally join multiple files with load_join_files()
Hive Path Support: Extract hive-style partition columns and open datasets with extract_hive_columns() and open_from_hive_path()

☁️ Cloud Storage Integration

S3 Operations: Comprehensive S3 support with cache_from_s3(), cache_to_s3(), cache_file_from_s3(), and cache_file_to_s3()
S3 Directory Sync: Synchronize directories with s3_sync_directory()
S3 File Operations: Copy files and list directories with s3_copy_file() and s3_list_directory()
Safe Operations: safely_cache_to_s3() provides error handling for cloud operations

🛠️ Utilities & Helpers

Column Management: Add columns (add_column(), add_additional_columns()), append quarters (append_quarter()), and convert types (integer_as_character())
Data Sampling: Quick subsampling with quick_filter()
Utilities: Helper functions like when_null() for NULL handling, lsos() for memory management, and create_temp_dir() for temporary directories
CLI Alerts: Built-in notification system with cli_alert()

Installation

To install the development version of etlr from GitHub:

# Install devtools if you haven't already
install.packages("devtools")

# Install etlr
devtools::install_github("your-username/etlr")

Alternatively, if you have the package source files:

# Install from local source
devtools::install(".", dependencies = TRUE)

System Requirements

R (>= 3.6.0)
Required packages: assertthat, dplyr, lubridate, magrittr
Optional but recommended: arrow, readr, readxl, vroom, janitor, cli
For S3 functionality: AWS CLI must be installed and configured

Build from Source

# Clone the repository
git clone https://github.com/your-username/etlr.git
cd etlr

# Build and install
R CMD build .
R CMD INSTALL etlr_0.2.0.tar.gz

Usage

Basic Data Transformation

library(etlr)
library(dplyr)

# Create tercile bins
x <- runif(100)
terciles <- tercile(x, low_breakpoint = 0.3, high_breakpoint = 0.7)
table(terciles)

# Filter to month-end dates only
data <- tibble(timestamp_utc = seq(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "day"))
month_end_data <- filter_date(data, .filter_date_column = "timestamp_utc", .filter_eom = TRUE)

# Create quantile features
data <- data %>%
  mutate(quantile_feature = quantile_column(ID3, .num_quantiles = 5))

File Operations

# Serialize multiple files into one dataframe
consolidated <- serialize_directory(
  .source_dir = "path/to/data",
  .target_file = "output.feather",
  .load_fxn = readr::read_delim,
  .store_fxn = arrow::write_feather
)

# Convert files from CSV to Feather format
convert_files_in_folder(
  .source_dir = "input_csv/",
  .target_dir = "output_feather/",
  .old_extension = ".csv",
  .new_extension = ".feather",
  .load_fxn = readr::read_csv,
  .store_fxn = arrow::write_feather
)

# Load files with type conversion
data <- serialize_directory(
  .source_dir = "input/",
  .char_cols = c("id", "category"),
  .date_cols = c("date", "created_at"),
  .integer_cols = c("count", "quantity")
)

S3 Integration

# Fetch data from S3
data <- cache_from_s3(
  .bucket = "my-data-bucket",
  .key = "path/to/data.parquet",
  .profile = "my-aws-profile",
  .load_fxn = arrow::read_parquet
)

# Upload data to S3
cache_file_to_s3(
  .object = processed_data,
  .bucket = "my-output-bucket",
  .key = "results/results.feather",
  .profile = "my-aws-profile",
  .store_fxn = arrow::write_feather
)

# List S3 directory contents
s3_files <- s3_list_directory(
  .bucket = "my-data-bucket",
  .key = "data/",
  .profile = "my-aws-profile",
  .recursive = TRUE
)

Advanced Transformations

# Join multiple files
joined_data <- load_join_files(
  .source_file_path = "data/",
  .join_by = c("id", "date"),
  .load_fxn = readr::read_csv,
  .join_fxn = dplyr::left_join
)

# Extract hive partition columns and load
data <- open_from_hive_path(
  .datapath = "s3://bucket/year=2024/month=01/day=15/",
  .load_fxn = arrow::open_dataset
)

# Apply batch column value replacements
replacement_tibble <- tibble(
  colname = c("status", "status"),
  pattern = c("A", "B"),
  replacement = c("Active", "Inactive")
)
data <- replace_column_values(data, .l = replacement_tibble)

Feature Engineering

# Create ratio features
data <- data %>%
  mutate(ratio = feature_ratio(numerator, denominator))

# Calculate percent of total
data <- data %>%
  mutate(pct = percent_of_total(value))

# Apply pseudo-log transform (handles negative values)
data <- data %>%
  mutate(log_feature = pseudo_log(value, .sigma = 1))

# Create fractile bins
data <- data %>%
  mutate(risk_level = fractile(value, .levels = c("low", "medium", "high")))

Date Operations

# Apply month offset
data <- data %>%
  mutate(future_date = month_offset(date, .direction = "future", .num_months = 1))

# Time shift columns
data <- data %>%
  time_shift_column(
    .group_name = id,
    .date_column = date,
    .shift_columns = c(value, count),
    .num_periods = -1,
    .duration_fxn = lubridate::dweeks
  )

Contributing

We welcome contributions to etlr! Here's how you can help:

Getting Started

Fork the repository on GitHub
Clone your fork locally
Create a new branch for your feature (git checkout -b feature/amazing-feature)
Make your changes and ensure tests pass
Commit your changes with clear messages
Push to your fork (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow R style guidelines and use roxygen2 for documentation
Add tests for new functionality in tests/testthat/
Update documentation in man/ directory
Keep functions focused and modular
Use meaningful function and variable names
Handle edge cases and provide informative error messages

Testing

# Run all tests
devtools::test()

# Run specific test file
testthat::test_file("tests/testthat/test-tercile.R")

# Check package
devtools::check()

Code Style

Use the magrittr pipe (|>) for readability
Prefix internal/helper functions with .
Include comprehensive documentation with examples
Follow existing patterns in the codebase

Reporting Issues

If you encounter bugs or have suggestions:

Check existing issues to avoid duplicates
Create a new issue with:
- Clear title and description
- Reproducible example
- Expected vs actual behavior
- Session info (sessionInfo())

License

This package is licensed under MIT + file LICENSE. See the LICENSE file for details.

Author

Krisoye Smith - Author and Maintainer
- Email: krisoye.work@gmail.com

Acknowledgments

Built with dplyr, lubridate, arrow, and other excellent R packages
Inspired by the need for streamlined ETL workflows in data science

Version

Current version: 0.2.0

Additional Resources

For more detailed documentation on specific functions, see:

Package documentation: help(package = "etlr")
Individual function help: ?function_name
GitHub repository: [Link to repository]
Issue tracker: [Link to issues]

Happy Data Wrangling! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
R		R
man		man
tests		tests
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

etlr

Project Title

Description

Features

🔄 Data Transformation & Feature Engineering

📁 File Operations & Serialization

☁️ Cloud Storage Integration

🛠️ Utilities & Helpers

Installation

System Requirements

Build from Source

Usage

Basic Data Transformation

File Operations

S3 Integration

Advanced Transformations

Feature Engineering

Date Operations

Contributing

Getting Started

Development Guidelines

Testing

Code Style

Reporting Issues

License

Author

Acknowledgments

Version

Additional Resources

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages