Are you requesting a feature or an implementation?
To handle the partial MDP task, the recurrent policy is currently quite popular. We need to add a lstm layer after the original conv (or mlp) policy, and store the hidden states for training. But in SLM-lab, the RecurrentNet class has limited ablities. It is more like a concatenation of series of input states, and the hidden states of rnn are not stored, which weanken the recurrent policy seriously.
For example, I used it with the default parameters to solve the cartpole task, and it failed.
python run_lab.py slm_lab/spec/experimental/ppo/ppo_cartpole.json ppo_rnn_separate_cartpole train
Even I changed the max_frame parameter of the env from 500 to 50000, the RecurrentNet still couldn't work.
[2019-07-14 21:11:38,098 PID:18904 INFO logger.py info] Session 1 done
[2019-07-14 21:11:38,287 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [train_df metrics] final_return_ma: 58.26 strength: 35.4753 max_strength: 178.14 final_strength: 37.14 sample_efficiency: 9.07107e-05 training_efficiency: 6.71198e-06 stability: 0.846315
[2019-07-14 21:11:38,468 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df] epi: 647 t: 126 wall_t: 655 opt_step: 997120 frame: 49859 fps: 76.1206 total_reward: 126 total_reward_ma: 88.02 loss: 0.610099 lr: 1.44304e-37 explore_var: nan entropy_coef: 0.001 entropy: 0.0258675 grad_norm: nan
[2019-07-14 21:11:38,835 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df] epi: 648 t: 54 wall_t: 656 opt_step: 997760 frame: 49913 fps: 76.0869 total_reward: 54 total_reward_ma: 88.02 loss: 0.554544 lr: 1.44304e-37 explore_var: nan entropy_coef: 0.001 entropy: 0.217777 grad_norm: nan
[2019-07-14 21:11:38,835 PID:18906 INFO __init__.py log_metrics] Trial 0 session 3 ppo_rnn_separate_cartpole_t0_s3 [eval_df metrics] final_return_ma: 79.4461 strength: 57.5861 max_strength: 159.64 final_strength: 54.39 sample_efficiency: 9.59096e-05 training_efficiency: 4.81586e-06 stability: 0.899133
[2019-07-14 21:11:38,836 PID:18906 INFO logger.py info] Session 3 done
[2019-07-14 21:11:39,296 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [eval_df metrics] final_return_ma: 61.299 strength: 39.439 max_strength: 178.14 final_strength: 32.64 sample_efficiency: 0.000120629 training_efficiency: 6.06361e-06 stability: 0.84144
[2019-07-14 21:11:39,794 PID:18905 INFO logger.py info] Running eval ckpt
[2019-07-14 21:11:39,939 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df] epi: 649 t: 0 wall_t: 657 opt_step: 999680 frame: 50000 fps: 76.1035 total_reward: 84.25 total_reward_ma: 78.0294 loss: 2.42707 lr: 1.44304e-37 explore_var: nan entropy_coef: 0.001 entropy: 0.135592 grad_norm: nan
[2019-07-14 21:11:40,234 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [eval_df metrics] final_return_ma: 61.299 strength: 39.439 max_strength: 178.14 final_strength: 32.64 sample_efficiency: 0.000120629 training_efficiency: 6.06361e-06 stability: 0.84144
[2019-07-14 21:11:40,236 PID:18903 INFO logger.py info] Session 0 done
[2019-07-14 21:11:41,480 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df metrics] final_return_ma: 88.02 strength: 55.0476 max_strength: 178.14 final_strength: 32.14 sample_efficiency: 8.00063e-05 training_efficiency: 4.46721e-06 stability: 0.708828
[2019-07-14 21:11:42,347 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df metrics] final_return_ma: 78.0294 strength: 56.1694 max_strength: 84.39 final_strength: 62.39 sample_efficiency: 8.97979e-05 training_efficiency: 4.50698e-06 stability: 0.860915
[2019-07-14 21:11:43,242 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df metrics] final_return_ma: 78.0294 strength: 56.1694 max_strength: 84.39 final_strength: 62.39 sample_efficiency: 8.97979e-05 training_efficiency: 4.50698e-06 stability: 0.860915
[2019-07-14 21:11:43,243 PID:18905 INFO logger.py info] Session 2 done
[2019-07-14 21:11:49,818 PID:18839 INFO analysis.py analyze_trial] All trial data zipped to data/ppo_rnn_separate_cartpole_2019_07_14_210040.zip
[2019-07-14 21:11:49,818 PID:18839 INFO logger.py info] Trial 0 done
If you have any suggested solutions
I'm afraid to cause more bugs, so I'm sorry not able to add this new feature. But I provide two examples.
OpenAI baselines
pytorch-a2c-ppo-acktr-gail
With this feature, I believe SLM-Lab will be the top-1 in pytorch.
Thanks in advance!
Are you requesting a feature or an implementation?
To handle the partial MDP task, the recurrent policy is currently quite popular. We need to add a lstm layer after the original conv (or mlp) policy, and store the hidden states for training. But in SLM-lab, the RecurrentNet class has limited ablities. It is more like a concatenation of series of input states, and the hidden states of rnn are not stored, which weanken the recurrent policy seriously.
For example, I used it with the default parameters to solve the cartpole task, and it failed.
Even I changed the max_frame parameter of the env from 500 to 50000, the RecurrentNet still couldn't work.
If you have any suggested solutions
I'm afraid to cause more bugs, so I'm sorry not able to add this new feature. But I provide two examples.
OpenAI baselines
pytorch-a2c-ppo-acktr-gail
With this feature, I believe SLM-Lab will be the top-1 in pytorch.
Thanks in advance!