The eval step reward is currently implemented as "np.mean(rewards)/steps" which looks like it's supposed to return the mean reward per step. Due to numpy, this ends up being an array, however, of [np.mean(rewards)/s for s in steps] which is probably not what we want to log. Maybe we should just log the steps and the rewards and be done with it?
The eval step reward is currently implemented as "np.mean(rewards)/steps" which looks like it's supposed to return the mean reward per step. Due to numpy, this ends up being an array, however, of
[np.mean(rewards)/s for s in steps]which is probably not what we want to log. Maybe we should just log the steps and the rewards and be done with it?