Visual-Language (VL) Intelligence: Tasks, Representation Learning, and Large Models

Visual-Language (VL) Intelligence: Tasks, Representation Learning, and Large Models

Originally from AI Technology Review Compiled by Jocelyn Edited by Chen Caixian This article provides a comprehensive chronological survey of visual-language (VL) intelligence and summarizes the development of this field into three stages: The first stage is from 2014 to 2018, during which specialized models were designed for different tasks. The second era is from … Read more

Enhancing Multi-Modal Data: MixGen from Amazon’s Li Mu Team

Enhancing Multi-Modal Data: MixGen from Amazon's Li Mu Team

Follow our public account to discover the beauty of CV technology This article shares the paper「MixGen: A New Multi-Modal Data Augmentation」, how to perform data augmentation on multi-modal data? The Amazon Li Mu team proposed a simple and effective MixGen, significantly improving performance across multiple multi-modal tasks! Details are as follows: Paper link: https://arxiv.org/abs/2206.08358 Code … Read more

Qwen2.5-VL: Alibaba’s Latest Open Source Visual Language Model

Qwen2.5-VL: Alibaba's Latest Open Source Visual Language Model

🚀 Quick Read Model Introduction: Qwen2.5-VL is the flagship open-source visual language model from Alibaba’s Tongyi Qianwen team, available in three different sizes: 3B, 7B, and 72B. Main Features: Supports visual understanding, long video processing, structured output, and device operation. Technical Principles: Utilizes a series structure of ViT and Qwen2, supports multi-modal rotary position encoding … Read more

Embodied Intelligence and Multi-modal Language Models: Is GPT-4 Vision the Strongest Agent?

Embodied Intelligence and Multi-modal Language Models: Is GPT-4 Vision the Strongest Agent?

Author: PCA-EVAL Team Affiliation: Peking University & Tencent Abstract: Researchers from Peking University and Tencent have proposed the PCA-EVAL multi-modal embodied decision-making intelligence evaluation set. By comparing end-to-end decision-making methods based on multi-modal models with tool invocation methods based on LLMs, it has been observed that GPT-4 Vision demonstrates outstanding end-to-end decision-making capabilities from multi-modal … Read more

HuggingFace’s Experiments on Effective Tricks for Multimodal Models

HuggingFace's Experiments on Effective Tricks for Multimodal Models

Xi Xiaoyao Technology Says Original Author | Xie Nian Nian When constructing multimodal large models, there are many effective tricks, such as using cross-attention mechanisms to integrate image information into language models or directly combining image hidden state sequences with text embedding sequences as inputs to the language model. However, the reasons why these tricks … Read more