Eval step reward is logging something strange

The eval step reward is currently implemented as "np.mean(rewards)/steps" which looks like it's supposed to return the mean reward per step. Due to numpy, this ends up being an array, however, of `[np.mean(rewards)/s for s in steps]` which is probably not what we want to log. Maybe we should just log the steps and the rewards and be done with it?