Opportunities and Challenges of MoE Large Model Training and Inference

With the development of large model technology and the proposal of the Scaling Law in 2020, it has become a consensus in the industry to improve model performance by expanding data scale and increasing model parameters. However, current large models face many engineering challenges in training, inference, and application stages. Simply increasing the model size may lead to further declines in training and inference efficiency and increased deployment complexity. In this context, the Mixture of Experts (MoE) model, with its efficient architecture, is becoming one of the important directions for the development of large models.

The MoE large model consists of routing (or gating networks) and several expert networks, dynamically selecting the most suitable expert network based on the input data through routing to achieve more efficient performance. In terms of training, the MoE architecture can accelerate model convergence. Due to its sparsity and efficient gating mechanism, the MoE architecture can significantly improve training efficiency. For example, under the same computational conditions, Google’s MoE large model Switch Transformer can achieve the same accuracy as T5 at a speed 7 times faster, while its model size is 17 times that of T5^[1]. In terms of inference, the MoE architecture can exponentially increase inference speed. The inference stage of the MoE large model features partial expert activation, parallelism among experts, and shareable computation processes, allowing for significant improvements in inference speed. For example, DeepSeek MoE 16B performs comparably to LLaMA2-7B, but the former’s inference speed is 2.5 times that of the latter^[2]. Additionally, the MoE architecture also offers better flexibility and scalability. Through adaptive routing mechanisms and the addition or removal of experts, it can facilitate faster multi-task learning and multi-modal integration to adapt to various application scenarios. For instance, Tencent and Jietu Star have launched trillion-scale multi-modal large models based on MoE to better accommodate different modalities and tasks.

Despite these advantages, the engineering practice of MoE large models is still in its early stages and faces various challenges. In the training phase, MoE models are generally large and require higher demands on distributed training technologies. Software tools need to add support for expert parallelism beyond traditional data parallelism, tensor parallelism, and pipeline parallelism, optimizing data scheduling strategies, communication software stacks, etc.^[3] In the inference phase, on one hand, compared to the global compression of traditional large models, MoE needs to focus more on sparsity and modularity in strategy design. Software tools need to support MoE aggregation, low-frequency expert pruning, expert group quantization, and other compression technologies. On the other hand, in distributed deployment, detailed design around sparse characteristics is also required. Software tools must support expert parallel deployment strategies, optimizing node deployment strategies by considering expert correlations, activation frequencies, and other factors to further optimize communication costs^[4]. Moreover, since MoE automatically activates experts based on input data, it may lead to imbalanced loads among experts. Frequently activated experts may cause excessive consumption of computing nodes, creating bottlenecks, while long-idle experts may waste resources, thereby affecting system stability. Therefore, MoE large models need to pay more attention to the collaborative optimization of models and systems.

As the momentum of MoE large models heats up, large tech companies are also launching various software tools to lower the development and application threshold. At the framework level, mainstream large model frameworks are gradually adding support for MoE features. Microsoft DeepSpeed and NVIDIA Megatron-LM have both added optimizations for MoE features, and a batch of AI frameworks focusing on MoE training and inference have emerged, such as ZhiYuan FastMoE^[5], Baidu SE-MoE^[3][6], etc. At the platform level, optimizing systems and models to achieve efficient implementation of MoE large models has become a new direction. For example, Tencent’s self-developed trillion-parameter MoE large model has achieved a 108% performance improvement in training and a 70% cost reduction by optimizing the synergy between the mixture model and the Angel platform, including AngelPTM, AngelHCF, etc., while the overall performance in the inference stage has doubled and costs decreased by 50%^[7].

The MoE architecture has become a key direction in the large model field and is gradually transitioning from research to engineering implementation, from focusing on models to the productization of tool platforms. The AI Infra working group of the China Artificial Intelligence Industry Development Alliance has previously focused on large model infrastructure (training platforms, inference platforms, computing resource management platforms, etc.), and in the future will carry out work related to the engineering practice of MoE large models, including standard formulation, report writing, ecological salons, etc. All parties are welcome to participate and promote the high-quality development of the artificial intelligence industry together.

Contact Person

Teacher Yu 15650761587

[email protected]

Teacher Dong 15910462421

[email protected]

References:

[1] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity https://doi.org/10.48550/arXiv.2101.03961

[2] DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models https://arxiv.org/html/2401.06066v1

[3] SE-MoE https://ar5iv.labs.arxiv.org/html/2205.10034

[4] MOE-INFINITY: An Inference Framework Optimized for MoE Large Model Deployment https://developer.volcengine.com/articles/7390576997803491340

[5] FastMoE https://github.com/laekov/fastmoe

[6] In-depth Analysis of SE-MoE: Baidu’s Leading Scalable Distributed MoE Training and Inference Framework https://developer.baidu.com/article/detail.html?id=3323614

[7] Tencent’s Latest Trillion-Parameter Heterogeneous MoE Goes Live, Technical Details Exposed for the First Time! Authoritative Assessment Ranks First in China, Close to GPT-4o https://mp.weixin.qq.com/s/cBtHBBIsk7qq2WFmome6rA

[8] DeepSpeed https://www.deepspeed.ai/

[9] Megatron-LM https://github.com/NVIDIA/Megatron-LM

[10] Mixture of Experts Explained https://huggingface.co/blog/moe?continueFlag=a09556ebd7121bce97f7bbb8eb2598c8

-END–