π Hi, I'm Yantao Liu β currently a Researcher at Qwen.
I co-lead research and development in reward modeling, and also work on reinforcement learning training for large language models (LLMs).
Before graduation, I was a research intern in Z.AI and THU-KEG, work closely with Zijun Yao, Yixin Cao And Juanzi Li.
π For my full list of publications, please refer to my Google Scholar profile.
At Qwen, We are always looking forward to talented intern/researchers/engineers to join us! If you are interested (especially in reward modeling or RLHF), please feel free to reach out to me.
- Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models β Proposes Rationale Consistency to mitigate deceptive alignment in GenRMs, achieving SOTA on RM-Bench and improving RLHF performance.
- ToolRM: Towards Agentic Tool-Use Reward Modeling β A family of lightweight reward models tailored for tool-use scenarios, achieving up to 17.94% higher accuracy on tool calling tasks.
- PairJudge-RM β A pairwise reward model using knockout tournaments to improve Best-of-N sampling for LLMs.
- RM-Bench β A benchmark that tests reward models on subtle content differences and style bias resistance to better align language models.
If you're interested in reward modeling or any of my pub, feel free to email me personal email!
-
Qwen3.5: Towards Native Multimodal Agents
π° News: Top 3 Open Model in Arena (Jan 2026)
πΌ Role: Core Contributor (Lead RLHF part in LLM posttraining) -
Qwen3-Max-Instruct: Just Scale it
π° News: #4 Model in Expert Arena (Nov 2025), #6 Model Among All in Arena (Sep 2025)
πΌ Role: Core Contributor (Lead RLHF part in LLM posttraining) -
Qwen3-Max-Thinking: Pushing Qwen3-Max-Thinking Beyond its Limits
πΌ Role: Core Contributor (Lead RLHF part in LLM posttraining) -
Qwen3-2507: A Better Qwen3 for Everyone
π° News: #1 Open Model (#3 Among All) in Arena (Aug 2025)
πΌ Role: Core Contributor (Lead RLHF part in LLM posttraining) -
Qwen3-Next: Towards Ultimate Training & Inference Efficiency
πΌ Role: Core Contributor (Lead RLHF part in LLM posttraining) -
Qwen3: Think Deeper, Act Faster
πΌ Role: Contributor (Reward Modeling & RLHF)
-
M.S. Student
University of the Chinese Academy of Sciences
2022 β 2025 -
B.S. Student
Beijing University of Posts and Telecommunications
2018 β 2022
