-
-
Notifications
You must be signed in to change notification settings - Fork 192
Home
- THE UNRELEASED VERSION 2.x ATTEMPT is DISCONTINUED
The 2.x design explored a clean, composable pipeline architecture where transformations and validations at every stage — headers, data fields, and hashes — were expressed as arrays of Procs. The API was intentionally beautiful: readable, extensible, and very much in the spirit of Ruby. So beautiful, right?
The problem is performance. A Proc-based pipeline that runs once per file (headers) is fine. But data and hash transformations run once per row, and field-level transformations run once per field per row. At scale — millions of rows, dozens of columns — every Ruby Proc call is a re-entry into the Ruby VM. The elegance of the abstraction directly undermined the performance of the library.
The better path turned out to be C-acceleration combined with declarative options. Instead of user-composable Proc chains, common transformations are handled at the C level, driven by simple option flags. The API still reads cleanly at the call site, but the runtime never pays the cost of evaluating Ruby lambdas in the hot path.
The 2.x work wasn't wasted — it helped clarify the right shape of the API, even if the implementation approach had to change. The lesson: keep the pipeline paradigm as a conceptual model for the user, but don't let it become the execution model.
OUT-DATED CONTENT FOLLOWS:
SmarterCSV 2 is a Ruby Gem for smarter importing of CSV Files as Arrays of Hashes, suitable for parallel processing with Sidekiq or Resque, as well as direct processing of the resulting hashes with Rails, e.g. ActiveRecord or Mongoid.
SmarterCSV 2 is still an experimental pre-release, so if you want to use the stable 1.x version of this gem, please check the README for the main branch.
Ruby's CSV library's API is pretty old, and its processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records from it. Another shortcoming is that Ruby's CSV library does not have good support for huge CSV-files, e.g. there is no support for 'chunking' and/or parallel processing of the CSV-content (e.g. with Sidekiq),
As the existing CSV libraries didn't fit my needs, I was writing my own CSV processing - specifically for use in connection with Rails ORMs like ActiveRecord, Mongoid, or MongoMapper. In those ORMs you can easily pass a hash with attribute/value pairs to the create() method. The lower-level Mongo driver and Moped also accept larger arrays of such hashes to create a larger amount of records quickly with just one call.
-
Customize any Transformations or Validations
-
Error Handling
-
Parallel Processing