Analysis of Tongyi Qwen 2.5-Max Model

Analysis of Tongyi Qwen 2.5-Max Model

1、Qwen 2.5-Max Model Overview 1.1 Model Introduction Alibaba Cloud officially launched the Tongyi Qwen 2.5-Max on January 29, 2025, this is a large-scale Mixture of Experts (MoE) model that demonstrates exceptional performance and potential in the field of natural language processing. As an important member of the Qwen series, Qwen 2.5-Max stands out in comparison … Read more

Key Details of Qwen MoE: Enhancing Model Performance Through Global Load Balancing

Key Details of Qwen MoE: Enhancing Model Performance Through Global Load Balancing

Today, we share with you the latest paper from Alibaba Cloud Tongyi Qianwen team – Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models (Original paper link: https://arxiv.org/abs/2501.11873) This paper focuses on improving the training method of Mixture-of-Experts (MoEs) by relaxing local balance to global balance through lightweight communication, significantly … Read more

Qwen’s Year-End Gift: Enhancing MoE Training Efficiency

Qwen's Year-End Gift: Enhancing MoE Training Efficiency

Click the top to follow me Before reading this article, we sincerely invite you to click the “Follow” button, so that we can conveniently push similar articles to you in the future, and also facilitate your discussions and sharing. Your support is our motivation to keep creating~ Today, we will learn about a powerful technology … Read more

Qwen Series Technical Interpretation 3 – Architecture

Qwen Series Technical Interpretation 3 - Architecture

Shadows slant across the shallow water, a faint fragrance drifts in the moonlight at dusk. Hello everyone, I am the little girl selling hot dry noodles. I am very happy to share cutting-edge technology and thoughts in the field of artificial intelligence with my friends. Following the previous shares in the same series: Qwen Series … Read more

Mamba Evolution Disrupts Transformer: A100 Achieves 140K Context

Mamba Evolution Disrupts Transformer: A100 Achieves 140K Context

New Intelligence Report Editor: Editorial Department [New Intelligence Guide] The production-grade Mamba model with 52B parameters is here! This powerful variant, Jamba, has just broken the world record, capable of directly competing with Transformers, featuring a 256K ultra-long context window and a threefold throughput increase, with weights available for free download. The Mamba architecture, which … Read more

Getting Started with Mistral: An Introduction

Getting Started with Mistral: An Introduction

Getting Started with Mistral: An Introduction The open-source Mixtral 8x7B model launched by Mistral adopts a “Mixture of Experts” (MoE) architecture. Unlike traditional Transformers, the MoE model incorporates multiple expert feedforward networks (this model has 8), and during inference, a gating network is responsible for selecting two experts to work. This setup allows MoE to … Read more

Qwen1.5-MoE Open Source! Best Practices for Inference Training

Qwen1.5-MoE Open Source! Best Practices for Inference Training

01 Introduction The Tongyi Qianwen team has launched the first MoE model in the Qwen series, Qwen1.5-MoE-A2.7B. It has only 2.7 billion activated parameters, but its performance can rival that of current state-of-the-art models with 7 billion parameters, such as Mistral 7B and Qwen1.5-7B. Compared to Qwen1.5-7B, which contains 6.5 billion Non-Embedding parameters, Qwen1.5-MoE-A2.7B has … Read more

Understanding Qwen1.5 MoE: Efficient Intelligence of Sparse Large Models

Understanding Qwen1.5 MoE: Efficient Intelligence of Sparse Large Models

Introduction Official Documentation: Qwen1.5-MoE: Achieving the Performance of 7B Models with 1/3 Activation Parameters | Qwen On March 28, Alibaba announced the open-source MoE technology large model Qwen1.5-MoE-A2.7B for the first time. This model is based on the existing Qwen-1.8B model. The activation parameters of Qwen1.5-MoE-A2.7B are 270 million, but it can achieve the performance … Read more

New Research: MoE + General Experts Solve Conflicts in Multimodal Models

New Research: MoE + General Experts Solve Conflicts in Multimodal Models

Hong Kong University of Science and Technology & Southern University of Science and Technology & Huawei Noah’s Ark Lab | WeChat Official Account QbitAI Fine-tuning can make general large models more adaptable to specific industry applications. However, researchers have now found that: Performing “multi-task instruction fine-tuning” on multimodal large models may lead to “learning more … Read more

Understanding MoE: Expert Mixture Architecture Deployment

Understanding MoE: Expert Mixture Architecture Deployment

Selected from the HuggingFace blog Translated by: Zhao Yang This article will introduce the building blocks of MoE, training methods, and the trade-offs to consider when using them for inference. Mixture of Experts (MoE) is a commonly used technique in LLMs aimed at improving efficiency and accuracy. The way this method works is by breaking … Read more