Empirical study examining whether market regime information improves reinforcement learning agents for optimal trade execution in simulated limit order book markets. Built on the CTMSTOU simulation environment from Amrouni et al. (2022) - JP Morgan AI Research.
The optimal execution problem: an institutional trader must buy a large quantity of shares at minimal cost before a deadline. Markets shift between regimes, bullish (rising prices) and bearish (falling prices), and a good trader behaves differently in each.
This study asks whether an RL agent can learn this regime-conditional behavior automatically. We conduct a controlled empirical study evaluating whether PPO-based agents can exploit regime information when introduced via state augmentation or reward conditioning, using multi-seed experiments and ablation analysis.
This work provides controlled empirical evidence that flat RL approaches fail to learn qualitatively correct regime-dependent execution behavior, motivating hierarchical formulations.
- Can a learned RL policy conditioned on market regime match hand-coded regime-aware rules?
- Is regime information in the state sufficient, or is reward conditioning also needed?
- Why does flat RL fail to exploit regime information even when it has access to it?
- RL agents achieve near-perfect order completion (1.000) vs TWAP (0.850)
- Neither state augmentation nor reward conditioning matches the hand-coded rule on cost (WAP 1.0003 vs 0.9950)
- The regime-aware agent exhibits highly polarized behavior across regimes, often deviating from the qualitatively optimal strategy.
- Regime sensitivity is initialization-dependent; single seeds show extreme sensitivity (action 0.92→0.00 on regime flip) while multi-seed average shows near-zero sensitivity
- Reward conditioning introduces training instability (WAP std 0.0131) without gains
- Hyperparameter sensitivity analysis confirms the gap persists regardless of training budget, suggesting the limitation is structural rather than due to insufficient training
| Strategy | WAP (mean ± std) | Completion |
|---|---|---|
| TWAP | 1.0278 ± 0.000 | 0.850 |
| Full Market Order | 1.0278 ± 0.000 | 1.000 |
| Regime Aware Rule | 0.9950 ± 0.000 | 0.996 |
| PPO Blind | 1.0003 ± 0.0000 | 1.000 |
| PPO State-Aware | 1.0004 ± 0.0001 | 1.000 |
| PPO Reward-Conditioned | 1.0069 ± 0.0131 | 0.996 |
WAP normalized to starting price, lower is better. Below 1.0 means buying cheaper than the opening price. Standard deviations are computed across 5 independent training seeds.
├── curves/ # Learning curves
├── figures/ # All paper figures
├── src/ # All python files
├── baselines.py # TWAP, Full MO, Regime-Aware rule baselines
├── ctmstou.py # CTMSTOU market simulator
├── environment.py # Gymnasium execution environment
├── plot_results.py # Figure generation
├── train.py # PPO training + multi-seed evaluation
└── README.md
conda create -n regime-exec python=3.9
conda activate regime-exec
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install stable-baselines3 gymnasium numpy matplotlibExpected runtime: several hours depending on CPU (multi-seed PPO training is the main cost).
cd src
python baselines.py # rule-based baselines (~1 min)
python train.py # all RL agents, 5 seeds each (several hours)
python plot_results.py # generate all figuresPretrained model checkpoints are not included. All results are generated via controlled multi-seed experiments using the provided training pipeline, ensuring full reproducibility without reliance on fixed model artifacts.
If you use this code, please cite:
Code (Zenodo):
DOI: https://doi.org/10.5281/zenodo.19441357
Preprint: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6559598
- Amrouni et al. (2022) - CTMSTOU driven markets (JP Morgan AI Research)
- Schulman et al. (2017) - Proximal Policy Optimization
- Stable Baselines 3