π About Me
π Hi there! I am Hao Sun, a final-year Ph.D. student at the University of Cambridge, supervised by Prof. Mihaela van der Schaar, working at the intersection of reinforcement learning (RL) and large language models (LLMs). During my M.Phil. study at MMLab@CUHK, I was advised by Prof. Dahua Lin and Prof. Bolei Zhou. I hold a B.Sc. in Physics from the Yuanpei College at Peking University, and a B.Sc. from the Guanghua School of Management. My undergraduate thesis was supervised by Prof. Zhouchen Lin.
I am seeking full-time research positions starting in 2025
π Research
Research interests and motivations: My research focuses on RL and LLM Alignment (also referred to as post-training). RL is the key toward super-human intelligence, and more powerful LLMs β optimized by RL β enable humans to learn from machine intelligence through natural language.
I am particularly proud of the following research works:
π§ Large Language Model Alignment (Since 2023)
- Query-Dependent Prompt Evaluation and Optimization with Offline Inverse RL
ICLR 2024, also NeurIPS 2023 ENLSP Workshop as Oral Presentation
Hao Sun, Alihan HΓΌyΓΌk, Mihaela van der Schaar- We studied the large language model inference-time optimization on mathematical reasoning tasks.
- We highlighted the importance of reward models in LLM inference-time optimization for math, which has been recognized by the community as an important topic since late 2024 (a year after the paper has been finished).
- Rethinking the Bradley-Terry Models in Preference-based Reward Modeling: Foundation, Theory, and its Alternatives
ICLR 2025 Oral (Top 1.2%)
Hao Sun*, Yunyi Shen*, Jean-Francois Ton (* denotes equal contribution)- We studied the foundation of preference-based reinforcement learning from human feedback (RLHF) practices, answering the foundational question of why the Bradley-Terry model is a solid (yet not necessary) choice in RLHF.
- We justified and poineered the research direction of embedding-based reward modeling.
- Our follow-up works further developed this agenda by studying active reward modeling in RLHF, efficient personalized alignment, and contributing computationally efficient infrastructure to the research community.
π€ Reinforcement Learning (Since 2018)
- Policy Continuation with Hindsight Inverse Dynamics
NeurIPS 2019 Spotlight (Top 2.4%)
Hao Sun, Zhizhong Li, Dahua Lin, Bolei Zhou- We innovated the first self-imitation-learning algorithm of multi-goal RL.
- Our paper pioneered the research field of supervised-learning-based goal-conditioned RL. This agenda has been further developed by the research works from UC Berkeley Paper 1 and Paper 2 since 2021.
- Our follow-up work extended this idea into general continuous control settings.
- Our follow-up work published at ICLR 2022 connects this idea with offline-RL
- Exploiting Reward Shifting in Value-based Deep RL
NeurIPS 2022
Hao Sun, Lei Han, Rui Yang, Xiaoteng Ma, Bolei Zhou- We added new insights to the fundamental dilemma of exploration and exploitation trade-offs through reward shifting.
- This line of research has been revisited and highlighted by research from Professor Richard Suttonβs group at RLCβ2024.
- This method has been widely verified in RL applications such as offline RL (ICJAIβ23)), robotics locomotion (IROSβ24), optimistic initialization (AAMASβ24) , multi-agent exploration (Recent preprint).
π° News!
πΈπ¬ (2025.04) Iβll attend ICLR 2025 in person.
πΊπΈ (2025.03) Guest lecture on Inverse RL Meets LLMs at the UCLA Reinforcement Learning course.
πΊπΈ (2025.02) Attending AAAI 2025 to run the Tutorial: Inverse RL Meets LLMs. Thanks for joining us in Philadelphia! Slide.
π (2025.02) Our Reward Model Paper Part IV: Multi-Objective and Personalized Alignment with PCA is online.
π (2025.02) Our Reward Model Paper Part III: Infrastructure for Reproducible Reward Model Research is online.
π (2025.02) Our Reward Model Paper Part II: Active Reward Modeling is online.
π (2025.01) Our Reward Model Paper Part I: Foundation, Theory, and Alternatives is accepted by ICLR as an Oral π. It is an amazing experience to co-lead this paper with Yunyi and advised by Jef.
π¦πΉ (2024.12) We will run the Tutorial: Inverse RL Meets LLMs at ACL-2025, see you at Vienna!
π¬π§ (2024.10) New talk on Inverse RL Meets LLMs at the vdsLab2024 OpenHouse and UCLA Zhou Lab. Slide is online
π (2024.09) Our Data Centric Reward Modeling paper is accepted by the Journal of Data-Centric Machine Learning Research (DMLR).
πΊπΈ (2024.08) InverseRLignment is presented at the RL beyond reward workshop (accepted with score 9) at the 1-st RLConference, it builds reward models from SFT data..
π (2024.05) Our RLHF with Dense Reward paper is accepted by ICML 2024.
π¬π§ (2024.03) Prompt-OIRL and RATP are featured at the Inspiration Exchange, recording is online .
π¦πΉ (2024.01) 1 RL + LLM Reasoning paper is accepted by ICLR 2024! Prompt-OIRL uses Inverse RL to evaluate and optimize prompts for Math Reasoning.
πΊπΈ (2024.01) Invited talk on RLHF at the Intuit AI Research Forum. slide
π¨π³ (2023.12) Invited talk on RLHF at the Likelihood Lab slide
π¨π³ (2023.11) Invited talk on RLHF at the CoAI group, THU.. slide
π (2023.10) Prompt-OIRL is selected as an oral presentation π at the NeurIPS 2023 ENLSP workshop!
π (2023.10) I wrote an article to share my thoughts as an RL researcher in the Era of LLMs.
π (2023.09) 2 papers on Interpretable Offline RL and Interpretable Uncertainty Quantification are accepted by NeurIPS 2023.