Alibaba’s 7B Multimodal Document Understanding Model Achieves New SOTA

Alibaba's 7B Multimodal Document Understanding Model Achieves New SOTA

mPLUG Team Contribution QbitAI | WeChat Official Account New SOTA in Multimodal Document Understanding! Alibaba’s mPLUG team has released the latest open-source work mPLUG-DocOwl 1.5, proposing a series of solutions to tackle four major challenges: high-resolution image text recognition, general document structure understanding, instruction following, and external knowledge incorporation. Without further ado, let’s take a … Read more

MM-Interleaved: The Ultimate Open-Source Multimodal Generation Model

MM-Interleaved: The Ultimate Open-Source Multimodal Generation Model

Machine Heart Column Machine Heart Editorial Team In the past few months, with the successive releases of major works like GPT-4V, DALL-E 3, and Gemini, “the next step for AGI”—multimodal generative large models have rapidly become the focus of scholars worldwide. Imagine, AI not only chats but also has “eyes” that can understand images, and … Read more

Handling Noisy Imbalanced Multimodal Data: A Review

Handling Noisy Imbalanced Multimodal Data: A Review

Multimodal fusion aims to integrate information from various modalities to achieve more accurate predictions. Significant progress has been made in multimodal fusion across a wide range of scenarios including autonomous driving and medical diagnosis. However, the reliability of multimodal fusion in low-quality data environments remains largely unexplored. This paper reviews the common challenges and recent … Read more

Multimodal Opportunities in the Post-GPT Era

Multimodal Opportunities in the Post-GPT Era

Author: Wang Yonggang, Founder/CEO of SeedV Lab, Executive Dean of AI Academy at Innovation Works The advent of ChatGPT/GPT-4 has completely transformed the research landscape in the NLP field and ignited the first spark towards AGI with its multimodal potential. Thus, the era of AI 2.0 has arrived. But where will the technological train of … Read more

Ant Group’s Technical Exploration in Video Multimodal Retrieval

Ant Group's Technical Exploration in Video Multimodal Retrieval

Introduction This article shares the research achievements of Ant Group’s multimodal cognitive team in the field of video multimodal retrieval over the past year. The article focuses on how to improve the effectiveness of video-text semantic retrieval and how to efficiently perform video-source retrieval. Main Sections Include: 1. Overview 2. Video-Text Semantic Retrieval 3. Video-Video … Read more

HuggingGPT: From Multimodal to AGI

HuggingGPT: From Multimodal to AGI

GPT Source: Machine Heart ChatGPT has become the manager of hundreds of models. In recent months, the successive popularity of ChatGPT and GPT-4 has showcased the extraordinary capabilities of large language models (LLMs) in language understanding, generation, interaction, and reasoning, attracting significant attention from both academia and industry, and revealing the potential of LLMs in … Read more

Building Powerful Multimodal Search with Voyager-3 and LangGraph

Building Powerful Multimodal Search with Voyager-3 and LangGraph

Embedding images and text in the same space allows us to perform high-precision searches on multimodal content such as web pages, PDFs, magazines, books, brochures, and various papers. Why is this technology so interesting? The most exciting aspect of embedding text and images in the same space is that you can search for and retrieve … Read more

Experience Local Deployment of VisualGLM-6B Multimodal Dialogue Model

Experience Local Deployment of VisualGLM-6B Multimodal Dialogue Model

The VisualGLM-6B multimodal dialogue model used in this article is open-sourced by Zhizhu AI and the KEG Lab of Tsinghua University. It can describe images and answer related knowledge questions. This article will guide you to experience the capabilities of this multimodal dialogue model by personally feeling its practical effects through local deployment. 1. Environment … Read more

Experience the New Version of Zhipu GLM-PC: Upgrading Multimodal Agents for Autonomous Computer Operation

Experience the New Version of Zhipu GLM-PC: Upgrading Multimodal Agents for Autonomous Computer Operation

Introduction to GLM-PC GLM-PC, based on Zhipu‘s leading multimodal large model CogAgent, is the world’s first plug-and-play computer intelligent agent available to the public. It possesses human-like computer “observation” and “operation” capabilities, assisting users in efficiently handling various computer tasks. Since the release of GLM-PC v1.0 on November 29, 2024, and the commencement of internal … Read more

Smart GLM-PC Open Experience: Upgraded Multimodal Agent

Smart GLM-PC Open Experience: Upgraded Multimodal Agent

GLM-PC is based on the intelligent multimodal large model CogAgent, the world’s first public computer agent that can be used immediately. It can “observe” and “operate” computers like a human, assisting users in efficiently completing various computer tasks. Since the release of GLM-PC v1.0 on November 29, 2024, and the opening of its internal testing, … Read more