Dr. Xiaotian Hao's Homepage

News

2023.12.09, 3 papers were accepted by AAAI-2024!

(a) Multiagent Gumbel MuZero: Efficient Planning in Combinatorial Action Spaces
(b) PORTAL: Automatic Curricula Generation for Multiagent Reinforcement Learning
(c) Designing Biological Sequences without Prior Knowledge using Evolutionary Reinforcement Learning

Give a talk at DAI 2023

On December 1, 2023, I was invited to deliver a presentation at the Workshop on Multi-Agent Systems in Complex Environments during the DAI 2023 conference.

An enjoyable journey to Singapore. After the pandemic, it's my first time going abroad for a conference, and I'm fortunate to be invited to give a presentation at the DAI workshop! At the beautiful Nanyang Technological University, there are many cutting-edge viewpoints on MARL, large decision models, agents, robot learning, and open environment RL colliding here. However, the rain in Singapore is a bit too frequent, haha.

2023.07，DataFunTalk社区，分享主题为《多智能体强化学习大模型初探》 的学术报告。

2023.01，One paper on multiagent learning accepted by ICLR 2023.

2022.10，I was awarded Huawei Excellent Intern (Top 3/200).

Work Experience

华为-诺亚方舟实验室-决策推理研究团队-Research Intern（2022.01-2023.07）

盘古大模型RLHF优化项目： (1)负责Reward Model设计与训练，基于ChatGLM LoRA，DeepSpeed搭建训练框架，针对产品线标注数据集，输出测试精度89%的模型。(2) 提出基于Stochastic Beam Search实现指数空间高效无放回top-k采样，提升样本质量&多样性，提高RAFT效率。(3) 整理RLHF技术博客。

高维优化项目： 针对多智能体系统中状态-动作空间高维难以优化问题，提出结合置换不变性与置换同变性的归纳偏置设计算法，在星际争霸等基准测试环境中取得SOTA性能，研究成果发表在ICLR 2023（第一作者），开源代码库pymarl3。

华为-诺亚方舟实验室-企业智能团队-Algorithm Engineer（2020.06-2021.06）

自研大规模线性规划求解器项目： 提出基于最大堆的pricing加速算法，预估变量价值构造最大堆实现高效top-1（入基）检索。在华为供应链多工厂排产等核心业务落地，千万变量线性规划求解时间提速一倍以上，已集成在华为云天筹AI求解器对外发布，该技术被评为华为潜高等级专利，专利公开号：CN115496247A（Idea的提出人和实现者）。

Dynamic Pickup and Delivery Problems (DPDP)优化项目： 针对复杂的DPDP组合优化问题，提出基于RL的分层搜索算法，上层负责将在线动态问题切分为静态问题（学习最优切分方式），下层负责静态问题求解，在松山湖制造投递运输系统落地测试，订单超时减少42%，车辆行驶距离减少23%，研究成果发表在NeurIPS 2021（共同第一作者）。

阿里巴巴集团-阿里妈妈-精准广告团队-Research Intern（2018.10-2019.10）

多渠道联合的广告序列化投放项目： 广告精排阶段，研究广告不同投放组合对用户未来行为及价值的影响，建模为动态背包问题（物品价值和成本受投放策略的影响），采用Bilevel优化，并证明最优性。线上A/B test广告主ROI +25%，双十一期间20%流量，ROI +11%，研究成果发表在ICML-2020（第一作者）。

手淘Banner钻石展位，多智能体锦囊优化项目： 广告召回阶段，广告主之间的博弈优化，learning算法加速求解大规模b-matching问题，离线测试大盘RPM提升9%，研究成果发表在IJCAI-2020（第一作者）。

Selected Publications

Multiagent Gumbel MuZero: Efficient Planning in Combinatorial Action Spaces (AAAI 2024).

Xiaotian Hao, Jianye Hao, Chenjun Xiao, Kai Li, Dong Li.

AlphaZero and MuZero have achieved state-of-the-art (SOTA) performance in a wide range of domains, including board games and robotics, with discrete and continuous action spaces. However, to obtain an improved policy, they often require an excessively large number of simulations, especially for domains with large action spaces. As the simulation budget decreases, their performance drops significantly. In addition, many important real-world applications have combinatorial (or exponential) action spaces, making it infeasible to search directly over all possible actions. In this paper, we extend AlphaZero and MuZero to learn and plan in more complex multiagent Markov decision processes, where the action spaces increase exponentially with the number of agents. Our new algorithms, Multiagent Gumbel AlphaZero and Multiagent Gumbel MuZero, respectively without and with model learning, achieve SOTA performance on typical cooperative multiagent control problems and more challenging StarCraft II benchmarks, while reducing the number of environmental interactions by up to an order of magnitude compared to SOTA model-free approaches. In particular, we significantly improve prior performance when planning with much fewer simulation budgets. We will open-source the code sooner, hoping to accelerate the research of MCTS-based algorithms in wider communities.

AAAI 2024. [Paper] [Code]

Boosting Multiagent Reinforcement Learning via Permutation Invariant and Permutation Equivariant Networks (ICLR 2023).

Xiaotian Hao, Jianye Hao, Hangyu Mao, Weixun Wang, Yaodong Yang, Dong Li, Yan Zheng

The state space in Multiagent Reinforcement Learning (MARL) grows exponentially with the agent number. Such a curse of dimensionality results in poor scalability and low sample efficiency, inhibiting MARL for decades. To break this curse, we propose a unified agent permutation framework that exploits the permutation invariance (PI) and permutation equivariance (PE) inductive biases to reduce the multiagent state space. Our insight is that permuting the order of entities in the factored multiagent state space does not change the information. Specifically, we propose two novel implementations: a Dynamic Permutation Network (DPN) and a Hyper Policy Network (HPN). The core idea is to build separate entity-wise PI input and PE output network modules to connect the entity-factored state space and action space in an end-to-end way. DPN achieves such connections by two separate module selection networks, which consistently assign the same input module to the same input entity (guarantee PI) and assign the same output module to the same entity-related output (guarantee PE). To enhance the representation capability, HPN replaces the module selection networks of DPN with hypernetworks to directly generate the corresponding module weights. Extensive experiments in SMAC, SMACv2, Google Research Football, and MPE validate that the proposed methods significantly boost the performance and the learning efficiency of existing MARL algorithms. Remarkably, in SMAC, we achieve 100% win rates in almost all hard and super-hard scenarios (never achieved before).

ICLR 2023. [Paper] [Code]

A Hierarchical Reinforcement Learning Based Optimization Framework for Large-scale Dynamic Pickup and Delivery Problems (NeurIPS 2021, CCF-A).

Yi Ma*, Xiaotian Hao*, Jianye Hao, Jiawen Lu, Mingxuan Yuan, Zhaopeng Meng

The Dynamic Pickup and Delivery Problem (DPDP) is an essential problem in the logistics domain, which is NP-hard. The objective is to dynamically schedule vehicles among multiple sites to serve the online generated orders such that the overall transportation cost could be minimized. The critical challenge of DPDP is the orders are not known a priori, ie, the orders are dynamically generated in real-time. To address this problem, existing methods partition the overall DPDP into fixed-size sub-problems by caching online generated orders and solve each sub-problem, or on this basis to utilize the predicted future orders to optimize each sub-problem further. However, the solution quality and efficiency of these methods are unsatisfactory, especially when the problem scale is very large. In this paper, we propose a novel hierarchical optimization framework to better solve large-scale DPDPs. Specifically, we design an upper-level agent to dynamically partition the DPDP into a series of sub-problems with different scales to optimize vehicles routes towards globally better solutions. Besides, a lower-level agent is designed to efficiently solve each sub-problem by incorporating the strengths of classical operational research-based methods with reinforcement learning-based policies. To verify the effectiveness of the proposed framework, real historical data is collected from the order dispatching system of Huawei Supply Chain Business Unit and used to build a functional simulator. Extensive offline simulation and online testing conducted on the industrial order dispatching system justify the superior performance of our framework over existing baselines.

NeurIPS 2021. [Paper]

Dynamic Knapsack Optimization Towards Efficient Multi-Channel Sequential Advertising (ICML 2020, CCF-A).

Xiaotian Hao, Zhaoqing Peng, Yi Ma, Junqi Jin, Jianye Hao, Rongquan Bai, Chuan Yu, Han Li, Jian Xu, Kun Gai.

In E-commerce, advertising is essential for merchants to reach their target users. The typical objective is to maximize the advertiser’s cumulative revenue over a period of time under a budget constraint. In real applications, an advertisement (ad) usually needs to be exposed to the same user multiple times until the user finally contributes revenue (eg, places an order). However, existing advertising systems mainly focus on the immediate revenue with single ad exposures, ignoring the contribution of each exposure to the final conversion, thus usually falls into suboptimal solutions. In this paper, we formulate the sequential advertising strategy optimization as a dynamic knapsack problem. We propose a theoretically guaranteed bilevel optimization framework, which significantly reduces the solution space of the original optimization space while ensuring the solution quality. To improve the exploration efficiency of reinforcement learning, we also devise an effective action space reduction approach. Extensive offline and online experiments show the superior performance of our approaches over state-of-the-art baselines in terms of cumulative revenue.

ICML 2020. [Paper] [Code]

Learning to Accelerate Heuristic Searching for Large-Scale Maximum Weighted b-Matching Problems in Online Advertising (IJCAI 2020, CCF-A).

Xiaotian Hao, Junqi Jin, Jianye Hao, Jin Li, Weixun Wang, Han Li, Jian Xu, Kun Gai.

Bipartite b-matching is fundamental in algorithm design, and has been widely applied into economic markets, labor markets, etc. These practical problems usually exhibit two distinct features: large-scale and dynamic, which requires the matching algorithm to be repeatedly executed at regular intervals. However, existing exact and approximate algorithms usually fail in such settings due to either requiring intolerable running time or too much computation resource. To address this issue, we propose NeuSearcher which leverages the knowledge learned from previously instances to solve new problem instances. Specifically, we design a multichannel graph neural network to predict the threshold of the matched edges weights, by which the search region could be significantly reduced. We further propose a parallel heuristic search algorithm to iteratively improve the solution quality until convergence. Experiments on both open and industrial datasets demonstrate that NeuSearcher can speed up 2 to 3 times while achieving exactly the same matching solution compared with the state-of-the-art approximation approaches.

ICML 2020. [Paper]

Independent Generative Adversarial Self-Imitation Learning in Cooperative Multiagent Systems (AAMAS 2019, CCF-B).

Xiaotian Hao, Weixun Wang, Jianye Hao, Yaodong Yang.

Many tasks in practice require the collaboration of multiple agents through reinforcement learning. In general, cooperative multiagent reinforcement learning algorithms can be classified into two paradigms: Joint Action Learners (JALs) and Independent Learners (ILs). In many practical applications, agents are unable to observe other agents' actions and rewards, making JALs inapplicable. In this work, we focus on independent learning paradigm in which each agent makes decisions based on its local observations only. However, learning is challenging in independent settings due to the local viewpoints of all agents, which perceive the world as a non-stationary environment due to the concurrently exploring teammates. In this paper, we propose a novel framework called Independent Generative Adversarial Self-Imitation Learning (IGASIL) to address the coordination problems in fully cooperative multiagent environments. To the best of our knowledge, we are the first to combine self-imitation learning with generative adversarial imitation learning (GAIL) and apply it to cooperative multiagent systems. Besides, we put forward a Sub-Curriculum Experience Replay mechanism to pick out the past beneficial experiences as much as possible and accelerate the self-imitation learning process. Evaluations conducted in the testbed of StarCraft unit micromanagement and a commonly adopted benchmark show that our IGASIL produces state-of-the-art results and even outperforms JALs in terms of both convergence speed and final performance..

ICML 2020. [Paper] [Code]

Selected Patents

基于最大堆的线性规划 pricing 加速算法. 李希君（华为实习导师），郝晓田（idea的提出者和实现者），袁明轩，郝建业，曾嘉. 公开号：CN115496247A

Selected Competitions & Awards

2021年11月：NeurIPS-2020-MineRL国际竞赛冠军

2021年12月：NeurIPS-2022 SMARTS自动驾驶国际竞赛最佳参赛奖

2022年8月：互联网 + 创新创业大赛天津赛区银奖

2022.10：华为优秀实习生 (Top 3/200)

2019.10：研究生国家奖学金 (Top 2/100)

| Links to my mentors and friends

Weixun Wang（王维埙）, Yi Ma（马亿）, Hongyao Tang（汤宏垚）, Yaodong Yang（杨耀东）, Fei Ni（倪飞）, Yihai Duan（段义海）, Hangyu Mao（毛航宇）, Yan Zheng（郑岩）, Jianye Hao（郝建业副教授）