Multi-GPU Training

Currently, training on multiple GPUs can be cumbersome because each subprocess individually must normalize and compute rarity features and this problem becomes increasingly worse as the size of our dataset increases. We can have just the main process do these expensive operations and cache for subprocesses.