Critical Step Optimization (CSO) is a novel post-training approach for LLM agents that focuses preference learning on verified critical steps where alternative actions demonstrably change task outcomes from failure to success. Unlike trajectory-level methods that apply coarse rewards uniformly, CSO provides fine-grained, step-specific supervision grounded in verified outcomes.
- 🎯 Verified Critical Steps: Only optimize on steps where alternative actions provably flip task outcomes
- 🔍 PRM-Guided Identification: Use Process Reward Models to efficiently identify candidate critical steps
- ✅ Outcome Verification: Ground supervision in verified task success through branch rollouts
- 📊 Efficient Supervision: Achieve superior performance with supervision at only ~16% of trajectory steps
- 🚀 Strong Performance: 37% and 26% relative gains over SFT baseline on GAIA and XBench-DeepSearch
- Collect Failed Trajectories: Run policy model on tasks, collect failed attempts
- Identify Critical Steps: Use PRM to score alternative actions at each step
- Branch & Verify: Replace policy action with high-scoring alternative and rollout to completion
- Generate Preferences: Create verified DPO pairs only when alternative leads to success
CSO/
├── system/ckv3/ # Core CK system implementation
│ ├── agents/ # Agent modules, reward models
│ ├── ck_main/ # Main CK system
│ ├── ck_web/ # Web environment support
│ └── ck_file/ # File processing utilities
│
└── scripts/
├── inference/ # Inference scripts
│ ├── run_gaia_with_planning_no_prm_l1_v0_ck.sh # Policy model baseline
│ ├── run_gaia_with_planning_no_prm_l1_v0_ck_dpo.sh # CSO-trained model
│ ├── run_resample_prm.sh # PRM resampling
│ └── run_continue_from_checkpoint.sh # Checkpoint continuation
│
└── data_processing/ # Data generation
├── convert_prm_to_llamafactory_dpo.py # Standard DPO (no verification)
├── verify_and_generate_dpo.py # ✨ Verified CSO DPO
└── run_verify_and_generate_dpo.sh # ✨ CSO pipeline
This project is based on Cognitive Kernel-Pro with custom modifications for CSO.
# Clone repository
git clone https://github.com/kiaia/CSO.git
cd CSO
# Install dependencies
pip install playwright anthropic openai azure-identity flask flask-cors
playwright install chromium
# Set Python path
export PYTHONPATH="${PYTHONPATH}:$(pwd)"See SETUP_GUIDE.md for detailed installation instructions.
# Configure your API keys
export AZURE_OPENAI_API_KEY="your_key"
export AZURE_OPENAI_ENDPOINT="your_endpoint"
# Or for OpenAI
export OPENAI_API_KEY="your_key"First, run your policy model (base model before CSO training) on your task dataset:
cd scripts/inference
# Configure your policy model endpoint (vLLM served)
export LLM_URL="http://your-policy-model:8081/v1/chat/completions"
# Run policy model to collect trajectories
bash run_gaia_with_planning_no_prm_l1_v0_ck.shInput: Task dataset with queries and ground truth answers
Output: Agent trajectories (both successful and failed)
Note: We focus on failed trajectories for CSO - these contain the critical decision points we want to optimize.
For each step in failed trajectories, generate alternative actions and score them with PRM:
bash run_resample_prm.shInput: Failed trajectories from Step 1
Output: Trajectories with PRM scores for alternative actions at each step
Execute high-scoring branches to verify they lead to correct answers:
cd ../data_processing
bash run_verify_and_generate_dpo.shInput: PRM-resampled trajectories
Output: Verified DPO pairs in LLaMAFactory format
Key thresholds (configurable in script):
GOOD_MIN=0.6: Minimum PRM score for chosen actionBAD_MAX=0.5: Maximum PRM score for rejected action
Train your model using the verified DPO data with LLaMAFactory:
llamafactory-cli train \
--stage dpo \
--model_name_or_path your_base_model \
--dataset verified_cso_dpo \
--output_dir output/cso_model \
--per_device_train_batch_size 4 \
--learning_rate 5e-6 \
--num_train_epochs 3Deploy your trained model using vLLM:
# Serve your CSO-trained model
vllm serve your_trained_model \
--port 8081 \
--max-model-len 32000Evaluate the CSO-trained model:
cd ../inference
# Configure to use your trained model endpoint
export LLM_URL="http://localhost:8081/v1/chat/completions"
bash run_gaia_with_planning_no_prm_l1_v0_ck_dpo.shIf you use this code or find our work helpful, please cite:
@misc{li2026verifiedcriticalstepoptimization,
title={Verified Critical Step Optimization for LLM Agents},
author={Mukai Li and Qingcheng Zeng and Tianqing Fang and Zhenwen Liang and Linfeng Song and Qi Liu and Haitao Mi and Dong Yu},
year={2026},
eprint={2602.03412},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.03412},
}- Cognitive Kernel-Pro: GitHub | Paper
- DeepVerifier: Paper
- Agent Self-Evolving Research: WebEvolver, WebCoT, WebVoyager, OpenWebVoyager, WebAggregatorQA.
This project is licensed under the Apache License 2.0.
This work builds upon Cognitive Kernel-Pro from Tencent AI Lab. We've made modifications for CSO - please install from this repository for compatibility.
