Real recurrent policy supported

**Are you requesting a feature or an implementation?**

To handle the partial MDP task, the recurrent policy is currently quite popular. We need to add a lstm layer after the original conv (or mlp) policy, and store the hidden states for training.   But in SLM-lab, the [RecurrentNet](https://github.com/kengz/SLM-Lab/blob/master/slm_lab/agent/net/recurrent.py) class has limited ablities. **It is more like a concatenation of series of input states, and the hidden states of rnn are  not stored, which weanken the recurrent policy seriously.**
For example, I used it with the default parameters to solve the cartpole task, and it failed.
```
python run_lab.py slm_lab/spec/experimental/ppo/ppo_cartpole.json ppo_rnn_separate_cartpole  train
```
Even I changed the **max_frame** parameter of the **env** from 500 to  50000, the RecurrentNet still couldn't work.

```
[2019-07-14 21:11:38,098 PID:18904 INFO logger.py info] Session 1 done
[2019-07-14 21:11:38,287 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [train_df metrics] final_return_ma: 58.26  strength: 35.4753  max_strength: 178.14  final_strength: 37.14  sample_efficiency: 9.07107e-05  training_efficiency: 6.71198e-06  stability: 0.846315
[2019-07-14 21:11:38,468 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df] epi: 647  t: 126  wall_t: 655  opt_step: 997120  frame: 49859  fps: 76.1206  total_reward: 126  total_reward_ma: 88.02  loss: 0.610099  lr: 1.44304e-37  explore_var: nan  entropy_coef: 0.001  entropy: 0.0258675  grad_norm: nan
[2019-07-14 21:11:38,835 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df] epi: 648  t: 54  wall_t: 656  opt_step: 997760  frame: 49913  fps: 76.0869  total_reward: 54  total_reward_ma: 88.02  loss: 0.554544  lr: 1.44304e-37  explore_var: nan  entropy_coef: 0.001  entropy: 0.217777  grad_norm: nan
[2019-07-14 21:11:38,835 PID:18906 INFO __init__.py log_metrics] Trial 0 session 3 ppo_rnn_separate_cartpole_t0_s3 [eval_df metrics] final_return_ma: 79.4461  strength: 57.5861  max_strength: 159.64  final_strength: 54.39  sample_efficiency: 9.59096e-05  training_efficiency: 4.81586e-06  stability: 0.899133
[2019-07-14 21:11:38,836 PID:18906 INFO logger.py info] Session 3 done
[2019-07-14 21:11:39,296 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [eval_df metrics] final_return_ma: 61.299  strength: 39.439  max_strength: 178.14  final_strength: 32.64  sample_efficiency: 0.000120629  training_efficiency: 6.06361e-06  stability: 0.84144
[2019-07-14 21:11:39,794 PID:18905 INFO logger.py info] Running eval ckpt
[2019-07-14 21:11:39,939 PID:18905 INFO __init__.py log_summary] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df] epi: 649  t: 0  wall_t: 657  opt_step: 999680  frame: 50000  fps: 76.1035  total_reward: 84.25  total_reward_ma: 78.0294  loss: 2.42707  lr: 1.44304e-37  explore_var: nan  entropy_coef: 0.001  entropy: 0.135592  grad_norm: nan
[2019-07-14 21:11:40,234 PID:18903 INFO __init__.py log_metrics] Trial 0 session 0 ppo_rnn_separate_cartpole_t0_s0 [eval_df metrics] final_return_ma: 61.299  strength: 39.439  max_strength: 178.14  final_strength: 32.64  sample_efficiency: 0.000120629  training_efficiency: 6.06361e-06  stability: 0.84144
[2019-07-14 21:11:40,236 PID:18903 INFO logger.py info] Session 0 done
[2019-07-14 21:11:41,480 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [train_df metrics] final_return_ma: 88.02  strength: 55.0476  max_strength: 178.14  final_strength: 32.14  sample_efficiency: 8.00063e-05  training_efficiency: 4.46721e-06  stability: 0.708828
[2019-07-14 21:11:42,347 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df metrics] final_return_ma: 78.0294  strength: 56.1694  max_strength: 84.39  final_strength: 62.39  sample_efficiency: 8.97979e-05  training_efficiency: 4.50698e-06  stability: 0.860915
[2019-07-14 21:11:43,242 PID:18905 INFO __init__.py log_metrics] Trial 0 session 2 ppo_rnn_separate_cartpole_t0_s2 [eval_df metrics] final_return_ma: 78.0294  strength: 56.1694  max_strength: 84.39  final_strength: 62.39  sample_efficiency: 8.97979e-05  training_efficiency: 4.50698e-06  stability: 0.860915
[2019-07-14 21:11:43,243 PID:18905 INFO logger.py info] Session 2 done
[2019-07-14 21:11:49,818 PID:18839 INFO analysis.py analyze_trial] All trial data zipped to data/ppo_rnn_separate_cartpole_2019_07_14_210040.zip
[2019-07-14 21:11:49,818 PID:18839 INFO logger.py info] Trial 0 done

```

**If you have any suggested solutions**

I'm afraid to cause more bugs, so I'm sorry not able to add this new feature. But I provide two examples.
[OpenAI baselines](https://github.com/openai/baselines/blob/master/baselines/ppo2/runner.py)
[pytorch-a2c-ppo-acktr-gail](https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/storage.py)


With this feature, I believe SLM-Lab will be the top-1 in pytorch.

Thanks in advance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real recurrent policy supported #383

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Real recurrent policy supported #383

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions