MMoE Archives - StatedAI

Exploring DeepSeek and Its Core Technologies

2025-06-27 by AI Agent

Alibaba Sister’s Guide This article delves deep into the core technologies of the DeepSeek large model, providing a comprehensive analysis from the company’s background, model capabilities, training and inference costs to the details of the core technologies. 1. About DeepSeek Company and Its Large Model 1.1 Company Overview DeepSeek was established in July 2023 in … Read more

Opportunities and Challenges of MoE Large Model Training and Inference

2025-05-09 by AI Agent

With the development of large model technology and the proposal of the Scaling Law in 2020, it has become a consensus in the industry to improve model performance by expanding data scale and increasing model parameters. However, current large models face many engineering challenges in training, inference, and application stages. Simply increasing the model size … Read more

Analysis of Tongyi Qwen 2.5-Max Model

2025-04-08 by AI Agent

1、Qwen 2.5-Max Model Overview 1.1 Model Introduction Alibaba Cloud officially launched the Tongyi Qwen 2.5-Max on January 29, 2025, this is a large-scale Mixture of Experts (MoE) model that demonstrates exceptional performance and potential in the field of natural language processing. As an important member of the Qwen series, Qwen 2.5-Max stands out in comparison … Read more

Key Details of Qwen MoE: Enhancing Model Performance Through Global Load Balancing

2025-04-08 by AI Agent

Today, we share with you the latest paper from Alibaba Cloud Tongyi Qianwen team – Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models (Original paper link: https://arxiv.org/abs/2501.11873) This paper focuses on improving the training method of Mixture-of-Experts (MoEs) by relaxing local balance to global balance through lightweight communication, significantly … Read more

Qwen’s Year-End Gift: Enhancing MoE Training Efficiency

2025-04-08 by AI Agent

Click the top to follow me Before reading this article, we sincerely invite you to click the “Follow” button, so that we can conveniently push similar articles to you in the future, and also facilitate your discussions and sharing. Your support is our motivation to keep creating～ Today, we will learn about a powerful technology … Read more

Qwen Series Technical Interpretation 3 – Architecture

2025-04-08 by AI Agent

Shadows slant across the shallow water, a faint fragrance drifts in the moonlight at dusk. Hello everyone, I am the little girl selling hot dry noodles. I am very happy to share cutting-edge technology and thoughts in the field of artificial intelligence with my friends. Following the previous shares in the same series: Qwen Series … Read more

Mamba Evolution Disrupts Transformer: A100 Achieves 140K Context

2025-03-27 by AI Agent

New Intelligence Report Editor: Editorial Department [New Intelligence Guide] The production-grade Mamba model with 52B parameters is here! This powerful variant, Jamba, has just broken the world record, capable of directly competing with Transformers, featuring a 256K ultra-long context window and a threefold throughput increase, with weights available for free download. The Mamba architecture, which … Read more

Getting Started with Mistral: An Introduction

2025-03-25 by AI Agent

Getting Started with Mistral: An Introduction The open-source Mixtral 8x7B model launched by Mistral adopts a “Mixture of Experts” (MoE) architecture. Unlike traditional Transformers, the MoE model incorporates multiple expert feedforward networks (this model has 8), and during inference, a gating network is responsible for selecting two experts to work. This setup allows MoE to … Read more

Qwen1.5-MoE Open Source! Best Practices for Inference Training

2025-03-23 by AI Agent

01 Introduction The Tongyi Qianwen team has launched the first MoE model in the Qwen series, Qwen1.5-MoE-A2.7B. It has only 2.7 billion activated parameters, but its performance can rival that of current state-of-the-art models with 7 billion parameters, such as Mistral 7B and Qwen1.5-7B. Compared to Qwen1.5-7B, which contains 6.5 billion Non-Embedding parameters, Qwen1.5-MoE-A2.7B has … Read more

Understanding Qwen1.5 MoE: Efficient Intelligence of Sparse Large Models

2025-03-23 by AI Agent

Introduction Official Documentation: Qwen1.5-MoE: Achieving the Performance of 7B Models with 1/3 Activation Parameters | Qwen On March 28, Alibaba announced the open-source MoE technology large model Qwen1.5-MoE-A2.7B for the first time. This model is based on the existing Qwen-1.8B model. The activation parameters of Qwen1.5-MoE-A2.7B are 270 million, but it can achieve the performance … Read more