MLNLP community is a well-known machine learning and natural language processing community at home and abroad, covering domestic and foreign NLP master’s and doctoral students, university teachers, and corporate researchers.

The community’s vision is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning at home and abroad, especially the progress of beginners.

Source | Xixiaoyao Technology

Author | Xie Nian Nian

There are many effective tricks when constructing multimodal large models, such as using cross-attention mechanisms to integrate image information into language models, or directly combining image hidden state sequences with text embedding sequences as input to the language model.

However, why these tricks are effective and their computational efficiency are often explained very roughly or lack sufficient experimental validation.

The Hugging Face team recently conducted extensive experiments to verify which tricks are truly effective when building multimodal large models, drawing a series of highly valuable conclusions, even overturning commonly used viewpoints in previous literature.

Based on these validated effective tricks, the team open-sourced an 8B parameter visual large model—Idefics2, which is the most powerful among models of the same size, outperforming models four times its size in some benchmark tests, and is comparable to the closed-source model Gemini 1.5 Pro.

Hugging Face's Experiments on Effective Tricks for Multimodal Large Models

In addition, special dialogue training was conducted on Idefics2, which also performed quite well in interactions with users.

For example, analyzing data in the table and performing correct calculations:

Finding the required information in the resume and organizing it in JSON format:

Interpreting memes also looks quite good:

This meme depicts a young girl in a yellow raincoat who seems to be crossing a grassy field. She is holding a yellow object, possibly a toy or a device. The background of the photo is a green field with trees in the distance. The text on the meme reads “I finished work a day before the holiday”. This indicates that the girl is excitedly leaving work early, symbolizing her joyful run in the field before the holiday starts. The girl’s energetic posture combined with the theme of “work” creates a relaxed and relatable scene for viewers who might also be looking forward to the holiday.

The team also opened the source code and trial address, so interested friends can try the effects~

Trial address: https://huggingface.co/spaces/HuggingFaceM4/idefics2_playground

Paper title: What matters when building vision-language models?

Paper link: https://arxiv.org/pdf/2405.02246

Basic Structure of Multimodal Large Models

First, let’s briefly understand which parts make up a multimodal large model.

Generally, the training of the entire multimodal large model can be divided into two steps: multimodal understanding and multimodal generation, as shown in the figure below. Multimodal understanding includes multimodal encoders, input projections, and the backbone of the large model, while multimodal generation includes output projections and multimodal generators. Typically, during the training process, the parameters of the multimodal encoder, generator, and large model remain fixed and are not used for training; the main optimization focus will be on the input and output projections.

This article mainly focuses on the multimodal understanding capability, thus emphasizing the multimodal encoder and input projection parts.

Are Common Tricks Used in Building Multimodal Large Models Really Effective?

The Impact of Modal Encoders on Performance

Multimodal large models use pre-trained modal encoders to extract features from visual inputs and use the backbone of language models to extract features from text inputs. So what impact does choosing different visual and text models have on the final performance?

The author fixed the size of the pre-trained modules, the data used for multimodal pre-training, and the number of training updates. Under the cross-attention architecture, the performance significantly improved as the model upgraded in visual-language benchmark tests.

As shown in Table 1, replacing the language model LLaMA-1-7B with Mistral-7B improved performance by 5.1 percentage points.

Additionally, switching the visual encoder from CLIP-ViT-H to SigLIP-SO400M improved performance by 3.3 percentage points in benchmark tests, as shown in Table 2:

Conclusion: For fixed parameters, the quality of the language model backbone has a greater impact on the final VLM performance than the visual model backbone.

Which is Better: Fully Autoregressive Architecture or Cross-Attention Architecture?

The purpose of input projection is to connect the pre-trained visual and language modules, aligning visual and text inputs. There are two mainstream methods:

Cross-attention: Encoding the image through the visual module, and injecting image embeddings and text embeddings into different layers of the language model through cross-attention blocks.
Fully autoregressive architecture: The output of the visual encoder is directly concatenated with the text embeddings, and the entire sequence is used as input to the language model. The visual sequence can be compressed to improve computational efficiency.

To evaluate the advantages and disadvantages of the two architectures, the author froze the unimodal modules, training only the newly initialized parameters (one side using cross-attention, the other side using modal projection and pooling), and compared them under a fixed amount of training data. The high-frequency alternating arrangement of cross-attention blocks with language model layers can enhance visual-language performance. Following this setup, the cross-attention architecture has an additional 1.3 billion trainable parameters (totaling 2 billion), and the computational load increases by 10% during inference. Under these conditions, the performance of the cross-attention architecture improved by 7 percentage points compared to the fully autoregressive architecture, as shown in the second and third rows of the table below.

In the total parameters, the fully autoregressive architecture accounts for about 15%, while the cross-attention architecture accounts for about 25%. This low proportion may limit the expressiveness of training. The author unfreezes all parameters (including newly initialized and pre-trained unimodal module parameters) to compare the two architectures. To prevent the training loss of the fully autoregressive architecture from diverging, the LoRA method was used to adjust the pre-trained parameters while fully fine-tuning the newly initialized parameters, with experimental results shown in the last two rows of the table above.

This method significantly improved training stability: the performance of the fully autoregressive architecture increased by 12.9 percentage points, while the cross-attention architecture increased by 0.6 percentage points. Therefore, under the condition of increased adjustable parameters, the fully autoregressive architecture is more cost-effective.

Conclusion 1: When the unimodal pre-training modules are frozen, the cross-attention structure performs better than the fully autoregressive structure. However, once the unimodal network is unfrozen and trained, even though the cross-attention structure has more parameters, the fully autoregressive architecture exhibits better performance.

Conclusion 2: Under the fully autoregressive architecture, directly unfreezing the pre-trained modules may lead to instability in the training process. Using LoRA technology can effectively increase the model’s expressiveness while maintaining training stability.

More Image Tokens, Stronger Performance?

Previous studies typically passed all hidden states of the visual encoder directly to the modal projection layer and input them into the language model without pooling, leading to a huge number of tokens per image, thus increasing training costs. Studies [2,3] indicate that increasing the number of visual tokens can improve performance, but the author found that when using more than 64 visual tokens, performance did not improve further. The author speculates that in a theoretically infinite training and data scenario, more tokens may improve performance, but at a cost that is unacceptable in practical scenarios.

To address this issue, the author introduced a trainable Transformer pooler (such as Perceiver) to reduce the sequence length of each image’s hidden states. This method not only reduces the number of tokens but also improves the model’s performance. As shown in the table below, this method improved the performance by an average of 8.5 points compared to methods without pooling, and reduced the number of tokens required per image from 729 to 64.

Conclusion: Using a trainable pooler to reduce the number of visual tokens significantly improves computational efficiency during training and inference, while enhancing performance on downstream tasks.

Does Fixing Image Aspect Ratio and Resolution Affect Performance?

Visual encoders (such as SigLIP) are typically trained on fixed-size square images. Adjusting the image size alters its original aspect ratio, which can be problematic in certain tasks (such as reading long texts). Additionally, training only at a single resolution has limitations: low resolution may overlook critical visual details, while high resolution reduces training and inference efficiency. Allowing the model to handle images of different resolutions enables users to flexibly adjust computational resources as needed.

This article attempted to send images directly into the visual encoder without adjusting the image size or altering the aspect ratio. During training on fixed-size low-resolution square images, pre-trained positional embeddings were inserted, and LoRA parameters were used to adjust the visual encoder. The results are shown in the table below:

It can be seen that the strategy of preserving the aspect ratio (AR preserving) can maintain task performance while releasing computational flexibility. Furthermore, there is no need to uniformly adjust to high resolution, saving GPU memory and allowing for on-demand image processing.

Conclusion: Using pre-trained visual encoders on fixed-size square images to maintain the original aspect ratio and resolution accelerates training and inference, reduces memory consumption, and does not affect performance.

What Impact Does Splitting into Sub-images for Training Have on Performance?

Several papers have shown that splitting images into sub-images and then connecting them with the original image can improve downstream task performance, but at the cost of a significant increase in the number of image tokens that need to be encoded.

During the instruction fine-tuning phase, the author expanded each image into a list containing the original image and four cropped images. This way, the model can process either a single image (64 visual tokens) or an enhanced set of images (a total of 320 visual tokens) during inference, as shown in the table below:

This strategy is particularly effective for benchmarks like TextVQA and DocVQA, as they require high resolution to extract text from images. Even splitting only 50% of the training images did not affect performance.

Conclusion: Splitting images into sub-images during training can improve computational efficiency and performance during inference. The performance improvement is especially evident in tasks involving reading text from images.

Building Idefics2—An Open Advanced Visual Language Foundation Model

After discussing the factors affecting the performance of visual models, the author trained an open 8B parameter visual language model—Idefics2. Below, the construction of the model, dataset selection, and training phase will be elaborated.

1. Multi-stage Pre-training

We began with SigLIP-SO400M and Mistral-7B-v0.1, pre-training Idefics2 on three types of data.

Cross Image-Text Documents

The data source selected is the OBELICS dataset, which has been filtered and cleaned. This is an open cross-image-text document dataset containing 350 million images and 115 billion text tokens. The long document design of OBELICS allows the language model to learn to handle any number of cross images and texts while maintaining performance.

Image-Text Pairs

Next, we need to train the model using image-text pairs to learn the correspondence between images and their related texts. This article uses high-quality manually annotated image-text pair data from PMD and synthetic annotated data from the LAION COCO version, where images are annotated by models trained on COCO, resulting in less noise. A high-recall NSFW classifier was also used for filtering.

PDF Documents

To overcome the shortcomings of VLM in extracting text from images and documents, the author trained the Idefics2 model using 19 million industry documents from OCR-IDL, 18 million pages from PDFA6, and added Rendered Text to enhance recognition of diverse fonts and richly colored texts. The results are shown in the table below, and this setup significantly improved the model’s ability to read documents and extract images.

Training Process

To improve computational efficiency, pre-training is conducted in two phases. In the first phase, the maximum image resolution is set to 384 pixels, allowing for an average batch size of 2048 (covering 17,000 images and 25 million text tokens). 70% of the data is based on the OBELICS dataset (maximum sequence length of 2048), and 30% is from image-text pair datasets (maximum sequence length of 1536).

In the second phase, PDF documents are introduced, increasing the resolution to 980 pixels, maintaining the global batch size but reducing the single-machine batch size, using gradient accumulation to compensate for additional memory. Sample allocation is 45% for OBELICS (maximum sequence length of 2048), 35% for image-text pairs (maximum sequence length of 1536), and 20% for PDF documents (maximum sequence length of 1024). At the same time, images are randomly enlarged to cover different sizes.

Model Evaluation

This article selects VQAv2, TextVQA, OKVQA, and COCO for model evaluation. As shown in the table below:

Although Idefics2 has a smaller number of tokens per image, its efficiency allows it to outperform the current best foundational visual language models. Especially in the ability to understand text in images, Idefics2 demonstrates significant advantages. The following image shows an example of Idefics2-base recognizing handwritten fonts.

2. Instruction Fine-tuning

During the instruction fine-tuning phase, The Cauldron—a massive collection of 50 visual-language datasets covering a wide range of tasks such as visual question answering, counting, captioning, text transcription, and document understanding—was created. The dataset adopts a shared question/answer format, constructing multi-turn dialogues for multiple question/answer pairs. In addition, a pure text instruction dataset was added to teach the model to follow complex instructions and solve mathematical and arithmetic problems.

A variant of LoRA, DoRA, was used to fine-tune the base model. During fine-tuning, only the loss of the answer part of the Q/A pairs was calculated, and various strategies such as NEFTune were employed to add noise to the embeddings to reduce overfitting risk. The image resolution was randomly adjusted, and multi-turn interactions were randomly shuffled before inputting examples into the model.

As shown in the evaluation table, Idefics2 performed excellently on benchmarks like MMMU, MathVista, TextVQA, and MMBench, not only having higher computational efficiency during inference but also outperforming similarly sized visual language models (LLaVA-Next, DeepSeek-VL, MM1-Chat).

Idefics2 has performance comparable to advanced models four times its size, and in benchmarks like MathVista and TextVQA, it can even compete with the closed-source model Gemini 1.5 Pro.

3. Dialogue Scenario Optimization

Evaluation benchmarks often expect very concise answers, but humans tend to prefer longer generations when interacting with models. Idefics2 may struggle to grasp the “long” and “short” of generated responses when following expected formats precisely.

Therefore, after instruction fine-tuning, the author further trained Idefics2 on dialogue data. Idefics2 was fine-tuned for several hundred steps on LLaVA-Conv and ShareGPT4V.

User feedback indicates that in many interactions, Idefics2-chatty clearly outperformed the version that was only instruction fine-tuned. Here are some generation examples:

Conclusion

This article explores the effectiveness of common tricks in the literature when constructing multimodal large models through detailed experiments and draws a series of valuable conclusions. Moreover, the author practically applied these useful techniques to successfully build a high-performance 8B parameter visual language model—Idefics2. Among models of the same scale, Idefics2 exhibits state-of-the-art performance and higher inference efficiency, providing important references for research on multimodal large models.

References

[1] Karamcheti, S., S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh (2024). Prismatic vlms: Investigating the design space of visually-conditioned language models. [2] Vallaeys, T., M. Shukor, M. Cord, and J. Verbeek (2024). Improved baselines for data-efficient perceptual augmentation of llms. [3] Mm1: Methods, analysis & insights from multimodal llm pre-training.

Technical Communication Group Invitation

△ Long press to add assistant

Scan the QR code to add the assistant’s WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)

to apply to join the Natural Language Processing/Pytorch and other technical communication groups

About Us

MLNLP community is a grassroots academic community jointly built by scholars in machine learning and natural language processing at home and abroad. It has now developed into a well-known community for machine learning and natural language processing, aiming to promote progress between academia, industry, and enthusiasts in machine learning and natural language processing.

The community can provide an open communication platform for related practitioners’ further education, employment, and research. Everyone is welcome to follow and join us.

Hugging Face’s Experiments on Effective Tricks for Multimodal Large Models