DeepSeek-VL: A Preliminary Exploration of Multimodal Models

Following the release of large models for language, code, mathematics, etc., DeepSeek has brought another early achievement on the journey towards AGI…

DeepSeekVL, jointly expanding training data, model architecture, and training strategies, attempts to build the strongest open-source 7B and 1.3B multimodal models.

Highlights

Data: Multi-source multimodal data enhances the model’s general cross-modal capabilities, mixing a large proportion of pure text data to maintain the model’s language capabilities without degradation.
Architecture: Employs a dual-visual encoder structure that is sensitive to both low-level visual signals and high-level semantic information.
Training: Utilizes a three-phase training method, first aligning visual and language spaces, then enhancing the model’s general cross-modal understanding through pre-training, and finally aligning with human preferences using specific task data.
Experiments: Surpasses models of the same scale (7B parameters) such as EMU2-Chat/Yi-VL, and even exceeds larger scale models (17B parameters) like CogVLM.

Both the model and the paper have been open-sourced.

Paper link: https://arxiv.org/abs/2403.05525

Model download: https://huggingface.co/deepseek-ai

GitHub homepage: https://github.com/deepseek-ai/DeepSeek-VL

DeepSeek-VL: A Preliminary Exploration of Multimodal Models

Model Advantages

DeepSeek-VL integrates multimodal capabilities without losing language ability, providing detailed and organized responses to most real-world scenario questions. It can accept high-resolution images as input, up to 1024×1024, recognizing small objects within images. It also possesses general multimodal understanding capabilities, handling logic diagrams, web pages, formula recognition, scientific literature, natural images, and demonstrating intelligence in complex scenarios.

What is the actual experience like? Let’s look at some examples.

As we can see, DeepSeek-VL not only possesses strong image-text understanding capabilities but also generates highly organized responses. The powerful capabilities of DeepSeek-VL stem from the researchers’ comprehensive considerations in data, model structure, and training strategies.

Moreover, DeepSeek-VL has also performed impressively on public dataset leaderboards, surpassing models of the same scale (7B parameters) such as EMU2-Chat/Yi-VL, and even exceeding larger scale models (Vision+LLM total parameter count of 17B). At 1.3B, it even outperforms the current 2.7B size model (MobileVLM V2).

Additionally, in a comparison evaluation of DeepSeek-VL against other models using 99 test samples for human assessment, we utilized GPT-4V. The results showed that in most cases, GPT-4V preferred the quality of DeepSeek-VL’s responses. As shown in the figure below, compared to open-source multimodal models including Fuyu-8B, CogVLM, and InterLM, DeepSeek-VL was rated superior in over 60% of cases. Furthermore, compared to other proprietary models such as GPT-4V itself and the Qwen model, DeepSeek-VL also demonstrated exceptional performance.

Data – Diverse and Scalable

We are committed to ensuring our data is diverse while also being scalable. The data comes from resources such as Common Crawl, web code, e-books, educational materials, and arXiv articles. Our dataset broadly covers real-world scenarios including web page screenshots, PDF files, OCR datasets, charts, and knowledge-based content (expert knowledge, textbooks), aiming to include actual real-world scenarios as much as possible.

Architecture – Combining Language Understanding and Fine-grained Recognition

Considering efficiency and the needs of most real-world scenarios, DeepSeek VL integrates a hybrid visual encoder that effectively handles high-resolution images (1024 x 1024) while maintaining relatively low computational overhead. This design choice ensures the model captures key semantics and details of various visual tasks.

Training – Language and Images, Both Are Essential

DeepSeek-VL differs from traditional methods that fine-tune large language models (LLMs) through direct multimodal input. Instead, we advocate for pre-training LLMs using integrated visual and language data. We employ a training strategy that leans towards language while also endowing the model with strong multimodal understanding capabilities. We find that a higher language proportion (up to 70%) can achieve strong multimodal understanding capabilities while maintaining language proficiency. This strategy aims to develop deeper, shared representations across both modalities, enhancing the model’s comprehensive capabilities.

About DeepSeek

DeepSeek (深度求索) is dedicated to exploring the essence of AGI, gathering more creativity and productivity through open source.

In the future, we will continue to release larger scale, innovative frameworks, and models with better complex reasoning capabilities!

—end—

Leave a Comment Cancel reply