My Research
My full publication list can be found on my Google Scholar profile.
Recent Research Highlights
Since June 2023, my research has centered on leveraging LLMs as a key use case for RL, particularly from an Inverse RL perspective. My recent works focus on improving the general capabilities of LLMs through advanced reward modeling and alignment. Some key insights and contributions include:
Necessity of Alignment in Any Application of LLMs: Any use of LLMs can be significantly enhanced through reward modeling and alignment. Without such models, LLMs function as universal samplers, but integrating reward models allows optimization and search at inference time.
- Reward Modeling from an Inverse RL Lens: My work addresses both data and model aspects of reward modeling from an Inverse RL lens:
- Prompt-OIRL for improving reasoning,
- InverseRLignment for building reward models from demonstration data,
- DataCOPE for evaluating the data for reward modeling, and reward modeling reliability,
- ABC for addressing credit assignment via dense rewards, and
- RATP for modeling LLMs’ thought processes as MDPs and using MCTS and reward models to optimize.
- Order-Consistency in Reward Modeling: We recently developed an order-consistency framework for reward modeling in alignment. This includes the first asymptotic theory justifying the use of Bradley-Terry models and classifiers, supported by large-scale experiments (over 100,000 runs). (Paper and code to be released soon.)
Research Philosophy
- I am equally passionate about both the scientific discovery and engineering aspects of research, believing that great research must clearly separate and achieve both types of progress. Philosophically, I view science as a process of denoising — uncovering the minimal rules that explain complex observations or finding the minimalist approach to solve a practical problem effectively brings me great fulfillment. One of my favorite films is The Theory of Everything, and it is an honor to pursue my PhD at DAMTP, Cambridge — where the story took place.
- In my research journey, several key contributions reflect this philosophy: I introduced self-imitation as a strong control method (PCHID); demonstrated Q-learning can be highly-efficient for continuous control (ZOSPI); early termination and recurrent networks are sufficient to solve constrained-MDPs (ETMDP); I propose to use linear reward shifting as a powerful technique for either exploration or exploitation, in both online and offline RL (RewardShifting); I use tree-based reward models to streamline reward modeling research, offering high flexibility and efficient ensemble practice without heavy memory usage.
Research Keywords
🤖️ My research focuses on Reinforcement Learning, a fundamental path toward Superhuman Intelligence. Applications of my work span across robotics🦾, healthcare💉, finance📈, and large language models🧠. Some of my research keywords include:
- (Inverse) RL in Language Models (2023-); Inverse RL (2021-); Interpretable RL (2023-);
- Uncertainty Quantification (2022-); Off-Policy Evaluation and Reward Modeling (2022-);
- Value-Based Deep-RL (2021-); Offline RL (2021-); Optimism in Exploration (2021-);
- Continuous Control via Supervised Learning (2020-); Goal-Conditioned RL (2020-)
- RL for Robotics and Control (2019-)