Skip to content

krisoye/etlr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

etlr

Project Title

etlr - A Variety of R ETL Features

Description

etlr is a comprehensive R package designed to streamline Extract, Transform, and Load (ETL) operations for data science and analytics workflows. It provides a rich set of tools for data transformation, file management, cloud storage integration, and feature engineering that simplifies common data processing tasks.

The package is built on top of popular R packages like dplyr, lubridate, and magrittr, providing a cohesive interface for data manipulation tasks. Whether you're working with local files, cloud storage (S3), or need to transform data for machine learning pipelines, etlr offers the functionality you need.

Features

🔄 Data Transformation & Feature Engineering

  • Quantile Functions: Create quantile-based features with tercile(), quintile(), fractile(), and quantile_column()
  • Date Filtering: Intelligent date filtering with filter_date() supporting month-end and week-end filtering
  • Date Manipulation: Advanced date operations including month_offset() and time_shift_column()
  • Feature Engineering: Create ratios (feature_ratio()), calculate percentages of totals (percent_of_total()), and apply pseudo-logarithmic transformations (pseudo_log(), pseudo_exp())
  • Column Manipulation: Flexible column mutation with mutate_columns() supporting type conversions and transformations
  • Value Replacement: Batch replace column values with replace_column_values()

📁 File Operations & Serialization

  • Directory Serialization: Consolidate multiple files into a single dataframe with serialize_directory() and consolidate_files()
  • File Conversion: Convert files between formats (CSV, Feather, Parquet, etc.) with convert_files_in_folder()
  • File Loading/Storing: Unified interface with do_load() and do_store() supporting multiple backends (readr, arrow, vroom, readxl)
  • Join Files: Horizontally join multiple files with load_join_files()
  • Hive Path Support: Extract hive-style partition columns and open datasets with extract_hive_columns() and open_from_hive_path()

☁️ Cloud Storage Integration

  • S3 Operations: Comprehensive S3 support with cache_from_s3(), cache_to_s3(), cache_file_from_s3(), and cache_file_to_s3()
  • S3 Directory Sync: Synchronize directories with s3_sync_directory()
  • S3 File Operations: Copy files and list directories with s3_copy_file() and s3_list_directory()
  • Safe Operations: safely_cache_to_s3() provides error handling for cloud operations

🛠️ Utilities & Helpers

  • Column Management: Add columns (add_column(), add_additional_columns()), append quarters (append_quarter()), and convert types (integer_as_character())
  • Data Sampling: Quick subsampling with quick_filter()
  • Utilities: Helper functions like when_null() for NULL handling, lsos() for memory management, and create_temp_dir() for temporary directories
  • CLI Alerts: Built-in notification system with cli_alert()

Installation

To install the development version of etlr from GitHub:

# Install devtools if you haven't already
install.packages("devtools")

# Install etlr
devtools::install_github("your-username/etlr")

Alternatively, if you have the package source files:

# Install from local source
devtools::install(".", dependencies = TRUE)

System Requirements

  • R (>= 3.6.0)
  • Required packages: assertthat, dplyr, lubridate, magrittr
  • Optional but recommended: arrow, readr, readxl, vroom, janitor, cli
  • For S3 functionality: AWS CLI must be installed and configured

Build from Source

# Clone the repository
git clone https://github.com/your-username/etlr.git
cd etlr

# Build and install
R CMD build .
R CMD INSTALL etlr_0.2.0.tar.gz

Usage

Basic Data Transformation

library(etlr)
library(dplyr)

# Create tercile bins
x <- runif(100)
terciles <- tercile(x, low_breakpoint = 0.3, high_breakpoint = 0.7)
table(terciles)

# Filter to month-end dates only
data <- tibble(timestamp_utc = seq(as.Date("2024-01-01"), as.Date("2024-12-31"), by = "day"))
month_end_data <- filter_date(data, .filter_date_column = "timestamp_utc", .filter_eom = TRUE)

# Create quantile features
data <- data %>%
  mutate(quantile_feature = quantile_column(ID3, .num_quantiles = 5))

File Operations

# Serialize multiple files into one dataframe
consolidated <- serialize_directory(
  .source_dir = "path/to/data",
  .target_file = "output.feather",
  .load_fxn = readr::read_delim,
  .store_fxn = arrow::write_feather
)

# Convert files from CSV to Feather format
convert_files_in_folder(
  .source_dir = "input_csv/",
  .target_dir = "output_feather/",
  .old_extension = ".csv",
  .new_extension = ".feather",
  .load_fxn = readr::read_csv,
  .store_fxn = arrow::write_feather
)

# Load files with type conversion
data <- serialize_directory(
  .source_dir = "input/",
  .char_cols = c("id", "category"),
  .date_cols = c("date", "created_at"),
  .integer_cols = c("count", "quantity")
)

S3 Integration

# Fetch data from S3
data <- cache_from_s3(
  .bucket = "my-data-bucket",
  .key = "path/to/data.parquet",
  .profile = "my-aws-profile",
  .load_fxn = arrow::read_parquet
)

# Upload data to S3
cache_file_to_s3(
  .object = processed_data,
  .bucket = "my-output-bucket",
  .key = "results/results.feather",
  .profile = "my-aws-profile",
  .store_fxn = arrow::write_feather
)

# List S3 directory contents
s3_files <- s3_list_directory(
  .bucket = "my-data-bucket",
  .key = "data/",
  .profile = "my-aws-profile",
  .recursive = TRUE
)

Advanced Transformations

# Join multiple files
joined_data <- load_join_files(
  .source_file_path = "data/",
  .join_by = c("id", "date"),
  .load_fxn = readr::read_csv,
  .join_fxn = dplyr::left_join
)

# Extract hive partition columns and load
data <- open_from_hive_path(
  .datapath = "s3://bucket/year=2024/month=01/day=15/",
  .load_fxn = arrow::open_dataset
)

# Apply batch column value replacements
replacement_tibble <- tibble(
  colname = c("status", "status"),
  pattern = c("A", "B"),
  replacement = c("Active", "Inactive")
)
data <- replace_column_values(data, .l = replacement_tibble)

Feature Engineering

# Create ratio features
data <- data %>%
  mutate(ratio = feature_ratio(numerator, denominator))

# Calculate percent of total
data <- data %>%
  mutate(pct = percent_of_total(value))

# Apply pseudo-log transform (handles negative values)
data <- data %>%
  mutate(log_feature = pseudo_log(value, .sigma = 1))

# Create fractile bins
data <- data %>%
  mutate(risk_level = fractile(value, .levels = c("low", "medium", "high")))

Date Operations

# Apply month offset
data <- data %>%
  mutate(future_date = month_offset(date, .direction = "future", .num_months = 1))

# Time shift columns
data <- data %>%
  time_shift_column(
    .group_name = id,
    .date_column = date,
    .shift_columns = c(value, count),
    .num_periods = -1,
    .duration_fxn = lubridate::dweeks
  )

Contributing

We welcome contributions to etlr! Here's how you can help:

Getting Started

  1. Fork the repository on GitHub
  2. Clone your fork locally
  3. Create a new branch for your feature (git checkout -b feature/amazing-feature)
  4. Make your changes and ensure tests pass
  5. Commit your changes with clear messages
  6. Push to your fork (git push origin feature/amazing-feature)
  7. Open a Pull Request

Development Guidelines

  • Follow R style guidelines and use roxygen2 for documentation
  • Add tests for new functionality in tests/testthat/
  • Update documentation in man/ directory
  • Keep functions focused and modular
  • Use meaningful function and variable names
  • Handle edge cases and provide informative error messages

Testing

# Run all tests
devtools::test()

# Run specific test file
testthat::test_file("tests/testthat/test-tercile.R")

# Check package
devtools::check()

Code Style

  • Use the magrittr pipe (|>) for readability
  • Prefix internal/helper functions with .
  • Include comprehensive documentation with examples
  • Follow existing patterns in the codebase

Reporting Issues

If you encounter bugs or have suggestions:

  1. Check existing issues to avoid duplicates
  2. Create a new issue with:
    • Clear title and description
    • Reproducible example
    • Expected vs actual behavior
    • Session info (sessionInfo())

License

This package is licensed under MIT + file LICENSE. See the LICENSE file for details.

Author

Acknowledgments

  • Built with dplyr, lubridate, arrow, and other excellent R packages
  • Inspired by the need for streamlined ETL workflows in data science

Version

Current version: 0.2.0

Additional Resources

For more detailed documentation on specific functions, see:

  • Package documentation: help(package = "etlr")
  • Individual function help: ?function_name
  • GitHub repository: [Link to repository]
  • Issue tracker: [Link to issues]

Happy Data Wrangling! 🚀

About

etlr is a comprehensive R package designed to streamline Extract, Transform, and Load (ETL) operations for data science and analytics workflows. It provides a rich set of tools for data transformation, file management, cloud storage integration, and feature engineering that simplifies common data processing tasks.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages