Gemini 2.0: A New AI Model for the Era of Intelligent Agents

Gemini 2.0: A New AI Model for the Era of Intelligent Agents

.01 Overview In an era of rapid information iteration, Artificial Intelligence (AI) is changing our lives at an astonishing pace. From search engines to multimodal technologies, AI’s reach continues to extend, pushing the boundaries of human technology. As a pioneer in the AI field, Google DeepMind recently released its latest AI model—Gemini 2.0, heralding the … Read more

Phidata Multimodal Multi-Agent Framework Overview

Phidata Multimodal Multi-Agent Framework Overview

The open-source agent series focuses on introducing currently available open-source agent frameworks in the market, such as CrewAI, AutoGen, LangChain, phidata, Swarm, etc., discussing their advantages, disadvantages, features, effects, and usage. Interested friends can follow the public account “XiaozhiAGI” for continuous updates on cutting-edge AI technologies and products, such as RAG, Agent, Agentic workflow, AGI. … Read more

VideoLLaMA3: Advanced Multimodal Foundation Model

VideoLLaMA3: Advanced Multimodal Foundation Model

Click belowCard, follow “AICV and Frontier“ Paper: https://arxiv.org/abs/2412.09262 Code: https://github.com/DAMO-NLP-SG/VideoLLaMA3 01 Introduction A more advanced multimodal foundation model for image and video understanding. The core design philosophy of VideoLLaMA3 is vision-centric: Vision-centric training paradigm Vision-centric framework design. The key point of the vision-centric training paradigm is that high-quality image-text data is crucial for understanding both … Read more

WindSurf Update Testing & Open Source Multimodal AI Creation App

WindSurf Update Testing & Open Source Multimodal AI Creation App

Hello everyone, I’m Kate. Do you remember the English version of the AI creation app I shared yesterday? A user left a message asking if there is a Chinese voice version. Now, it’s finally here! And this time, it’s still open source! In this video, I will take you on a deep dive into the … Read more

Performance of 2B Parameters Surpasses Mistral-7B: Wall Intelligence Multimodal Edge Model Open Source

Performance of 2B Parameters Surpasses Mistral-7B: Wall Intelligence Multimodal Edge Model Open Source

Machine Heart reports Editor: Zenan Low-cost devices can run locally. As large models continue to evolve towards larger scales, recent developments have also been made in optimization and deployment. On February 1, Wall Intelligence, in collaboration with Tsinghua NLP Laboratory, officially launched its flagship edge large model “Wall MiniCPM” in Beijing. The new generation large … Read more

Alibaba’s 7B Multimodal Document Understanding Model Achieves New SOTA

Alibaba's 7B Multimodal Document Understanding Model Achieves New SOTA

mPLUG Team Contribution QbitAI | WeChat Official Account New SOTA in Multimodal Document Understanding! Alibaba’s mPLUG team has released the latest open-source work mPLUG-DocOwl 1.5, proposing a series of solutions to tackle four major challenges: high-resolution image text recognition, general document structure understanding, instruction following, and external knowledge incorporation. Without further ado, let’s take a … Read more

MM-Interleaved: The Ultimate Open-Source Multimodal Generation Model

MM-Interleaved: The Ultimate Open-Source Multimodal Generation Model

Machine Heart Column Machine Heart Editorial Team In the past few months, with the successive releases of major works like GPT-4V, DALL-E 3, and Gemini, “the next step for AGI”—multimodal generative large models have rapidly become the focus of scholars worldwide. Imagine, AI not only chats but also has “eyes” that can understand images, and … Read more

Handling Noisy Imbalanced Multimodal Data: A Review

Handling Noisy Imbalanced Multimodal Data: A Review

Multimodal fusion aims to integrate information from various modalities to achieve more accurate predictions. Significant progress has been made in multimodal fusion across a wide range of scenarios including autonomous driving and medical diagnosis. However, the reliability of multimodal fusion in low-quality data environments remains largely unexplored. This paper reviews the common challenges and recent … Read more

Multimodal Opportunities in the Post-GPT Era

Multimodal Opportunities in the Post-GPT Era

Author: Wang Yonggang, Founder/CEO of SeedV Lab, Executive Dean of AI Academy at Innovation Works The advent of ChatGPT/GPT-4 has completely transformed the research landscape in the NLP field and ignited the first spark towards AGI with its multimodal potential. Thus, the era of AI 2.0 has arrived. But where will the technological train of … Read more

Ant Group’s Technical Exploration in Video Multimodal Retrieval

Ant Group's Technical Exploration in Video Multimodal Retrieval

Introduction This article shares the research achievements of Ant Group’s multimodal cognitive team in the field of video multimodal retrieval over the past year. The article focuses on how to improve the effectiveness of video-text semantic retrieval and how to efficiently perform video-source retrieval. Main Sections Include: 1. Overview 2. Video-Text Semantic Retrieval 3. Video-Video … Read more