
Xi Xiaoyao Technology Says Original
Author | Xie Nian Nian
When constructing multimodal large models, there are many effective tricks, such as using cross-attention mechanisms to integrate image information into language models or directly combining image hidden state sequences with text embedding sequences as inputs to the language model.
However, the reasons why these tricks are effective and their computational efficiency are often explained very roughly or lack sufficient experimental validation.
The Hugging Face team recently conducted extensive experiments to verify which tricks are truly effective when building multimodal large models, resulting in a series of highly valuable conclusions, even overturning commonly used views in previous literature.
Based on these validated effective tricks, the team open-sourced an 8B parameter visual model—Idefics2, which is the most powerful among models of the same size, and its performance even surpasses models four times its size in certain benchmark tests, enough to compete with the closed-source model Gemini 1.5 Pro.

In addition, Idefics2 has undergone specialized conversational training, performing quite well during user interactions.
For example, analyzing data in a table and performing the correct calculations:
Finding the necessary information in a resume and organizing it in JSON format:

Interpreting memes is also quite decent:

This meme depicts a young girl wearing a yellow raincoat, seemingly walking through a grassy area. She is holding a yellow object, possibly a toy or a device. The background of the photo features a green field with trees in the distance. The text on the meme reads, “I got off work the day before vacation.” This suggests that the girl is excited to leave work early before the vacation starts, symbolizing her joy as she runs happily in the field. The girl’s energetic posture combined with the concept of “work” creates a relaxed and relatable scene for viewers who may also be looking forward to vacation.
The team has also opened the source code and trial address, so interested friends can try it out~
Trial Address: https://huggingface.co/spaces/HuggingFaceM4/idefics2_playground
Paper Title: What matters when building vision-language models?
Paper Link: https://arxiv.org/pdf/2405.02246
Basic Structure of Multimodal Large Models
First, let’s briefly understand the components of multimodal large models.
Generally speaking, the training of multimodal large models can be divided into multimodal understanding and multimodal generation, as shown in the figure below. Multimodal understanding includes three parts: multimodal encoder, input projection, and the main backbone of the large model, while multimodal generation includes output projection and multimodal generator. Typically, during the training process, the parameters of the multimodal encoder, generator, and large model are generally fixed and not used for training, with the main optimization focus on input projection and output projection.

This article mainly focuses on the multimodal understanding capability, thus paying special attention to the multimodal encoder and input projection parts.
Are Common Tricks in Building Multimodal Large Models Really Effective?
The Impact of Modality Encoders on Performance
Multimodal large models use pre-trained modality encoders to extract features from visual inputs and use the language model backbone to extract features from text inputs. So how does choosing different visual and text models affect the final performance?
The authors fixed the size of the pre-trained modules, the data used for multimodal pre-training, and the number of training updates. Under the cross-attention architecture, the performance significantly improved as the model upgraded in visual-language benchmark tests.
As shown in Table 1, replacing the language model LLaMA-1-7B with Mistral-7B improved performance by 5.1 percentage points.
Moreover, switching the visual encoder from CLIP-ViT-H to SigLIP-SO400M improved benchmark performance by 3.3 percentage points, as shown in Table 2:

Conclusion: For fixed parameters, the quality of the language model backbone has a greater impact on the final VLM performance than the visual model backbone.
Which is Better: Fully Autoregressive Architecture or Cross-Attention Architecture?
The purpose of input projection is to connect the pre-trained visual module and language module, aligning visual and text inputs. There are two mainstream methods:
-
Cross-Attention: Encodes images through the visual module and injects image embeddings and text embeddings into different layers of the language model through cross-attention blocks. -
Fully Autoregressive Architecture: The output of the visual encoder is directly concatenated with text embeddings and the entire sequence is input into the language model. The visual sequence can be compressed to improve computational efficiency.
To evaluate the advantages and disadvantages of the two architectures, the authors froze the unimodal modules and only trained newly initialized parameters (one side using cross-attention and the other side using modality projection and pooling) and compared them under a fixed amount of training data. The high-frequency alternating arrangement of cross-attention blocks with language model layers can enhance visual-language performance. Following this setup, the cross-attention architecture has an additional 1.3 billion trainable parameters (totaling 2 billion), and the computational load increases by 10% during inference. Under these conditions, the performance of the cross-attention architecture improved by 7 percentage points compared to the fully autoregressive architecture, as shown in the second and third rows of the table.

In the total parameters, the fully autoregressive architecture accounts for about 15%, while the cross-attention architecture accounts for about 25%. This low proportion may limit the training expressiveness. The authors unfreeze all parameters (including newly initialized and pre-trained unimodal module parameters) to compare the two architectures. To prevent the training loss of the fully autoregressive architecture from diverging, the LoRA method was used to adjust the pre-trained parameters while fully fine-tuning the newly initialized parameters, with experimental results as shown in the last two rows of the table.
This method significantly improves training stability: the performance of the fully autoregressive architecture increased by 12.9 percentage points, while the cross-attention architecture increased by 0.6 percentage points. Therefore, under the increased adjustable parameters, the fully autoregressive architecture is more cost-effective.
Conclusion 1: When the unimodal pre-training modules are frozen, the performance of the cross-attention structure is better than that of the fully autoregressive structure. However, once the unimodal networks are unfrozen and trained, despite the cross-attention structure having more parameters, the fully autoregressive architecture exhibits better performance.
Conclusion 2: Under the fully autoregressive architecture, directly unfreezing the pre-trained modules may lead to instability in the training process. Using LoRA technology can effectively increase the model’s expressiveness while maintaining training stability.
More Image Tokens, Stronger Performance??
Previous studies typically pass all hidden states of the visual encoder directly to the modality projection layer and input them into the language model without pooling, resulting in a large number of tokens for each image, thus increasing training costs. Studies [2,3] indicate that increasing the number of visual tokens can enhance performance, but the authors found that when using more than 64 visual tokens, performance did not further improve. The authors speculate that in a theoretically unlimited training and data scenario, more tokens may improve performance, but at a cost that is unacceptable in practical scenarios.
To solve this problem, the authors introduced a trainable Transformer pooler (such as Perceiver) to reduce the sequence length of each image’s hidden states. This method not only reduces the number of tokens but also improves model performance. As shown in the table below, compared to the method without pooling, this approach averaged an improvement of 8.5 points and reduced the number of tokens required per image from 729 to 64.
Conclusion: Using a trainable pooler to reduce the number of visual tokens significantly improves computational efficiency during training and inference while enhancing performance in downstream tasks.
Does Fixing Image Aspect Ratio and Resolution Affect Performance?
Visual encoders (such as SigLIP) are typically trained on fixed-size square images. Adjusting image sizes alters their original aspect ratios, which can be problematic for certain tasks (such as reading long text). Additionally, training on a single resolution has limitations: low resolution may overlook critical visual details, while high resolution reduces training and inference efficiency. Allowing the model to handle images of different resolutions can enable users to flexibly adjust computational resources.
This article attempted to directly send images in chunks to the visual encoder without adjusting image sizes or changing aspect ratios. During training on fixed-size low-resolution square images, pre-trained positional embeddings were inserted, and LoRA parameters were used to adjust the visual encoder. The results are shown in the table below:
It can be seen that the strategy of maintaining the original aspect ratio (AR preserving) accelerates training and inference while reducing memory consumption without compromising performance.
Conclusion: Using a pre-trained visual encoder on fixed-size square images to maintain the original aspect ratio and resolution accelerates training and inference, reduces memory consumption, and does not affect performance.
What is the Impact of Training on Sub-images?
Multiple pieces of literature suggest that segmenting images into sub-images and then connecting them with the original image can improve performance in downstream tasks, but at the cost of significantly increasing the number of image tokens to be encoded.
During the instruction fine-tuning phase, the authors expanded each image into a list containing the original image and four cropped images. Thus, the model can handle both a single image (64 visual tokens) and an enhanced image set (a total of 320 visual tokens) during inference; the results are shown in the table below:
This strategy is particularly effective for benchmarks like TextVQA and DocVQA, as they require high resolution to extract text from images. Even segmenting only 50% of the training images did not affect performance.
Conclusion: Splitting images into sub-images during training can enhance computational efficiency during inference and improve performance. The performance enhancement is particularly notable in tasks involving reading text from images.
Building Idefics2—An Open Advanced Visual Language Foundation Model
After discussing the factors affecting visual model performance, the authors trained an open 8B parameter visual language model—Idefics2. The following will elaborate on the model’s construction, dataset selection, and training phase process.
1. Multi-stage Pre-training
We started with SigLIP-SO400M and Mistral-7B-v0.1 and pre-trained Idefics2 on three types of data.
Cross-Image-Text Documents
The data source used is the OBELICS dataset, which has been filtered and cleaned. This is an open cross-image-text document dataset containing 350 million images and 115 billion text tokens. The long document design of OBELICS enables the language model to learn to handle any number of cross images and texts while maintaining performance.
Image-Text Pairs
Next, it is necessary to train the model using image-text pairs to learn the correspondence between images and their related texts. This article uses high-quality manually annotated image-text pair data from PMD, as well as synthetic annotated data from the LAION COCO version, where images are annotated by a model trained on COCO, resulting in less noise. A high-recall NSFW classifier was also used for filtering.
PDF Documents
To overcome the shortcomings of VLM in extracting text from images and documents, the authors trained the Idefics2 model using 19 million industry documents from OCR-IDL, 18 million pages of PDFA6 data, and included Rendered Text to enhance recognition of diverse fonts and richly colored texts. The results are shown in the table, and this setup significantly enhances the model’s ability to read documents and extract images.
Training Process
To improve computational efficiency, pre-training is conducted in two stages. In the first stage, the maximum image resolution is set to 384 pixels, allowing for an average batch size of 2048 (covering 17,000 images and 25 million text tokens). 70% of the data is based on the OBELICS dataset (maximum sequence length of 2048), and 30% is from the image-text pair dataset (maximum sequence length of 1536).
In the second stage, PDF documents are introduced, and the resolution is increased to 980 pixels while maintaining the overall batch size but reducing the per-machine batch size, using gradient accumulation to compensate for the extra memory. In terms of sample allocation, OBELICS accounts for 45% (maximum sequence length of 2048), image-text pairs account for 35% (maximum sequence length of 1536), and PDF documents account for 20% (maximum sequence length of 1024). At the same time, images are randomly enlarged to cover different sizes.
Model Evaluation
This article selects VQAv2, TextVQA, OKVQA, and COCO for model evaluation. As shown in the table:
Although Idefics2 has fewer tokens per image, its efficiency allows its performance to surpass the current best foundational visual language model. Notably, Idefics2 shows significant advantages in understanding text within images. The image below demonstrates Idefics2-base’s ability to recognize handwritten fonts.
2. Instruction Fine-tuning
During the instruction fine-tuning phase, a massive collection called The Cauldron was created, mixing 50 visual-language datasets covering a wide range of tasks such as visual question answering, counting, captioning, text transcription, document understanding, etc. The dataset employs a shared question/answer format, constructing multi-turn dialogues for multiple question/answer pairs. Additionally, a pure text instruction dataset was added to teach the model to follow complex instructions and solve math and arithmetic problems.
A variant of LoRA called DoRA was used to tune the base model during instruction fine-tuning. During fine-tuning, only the loss of the answer part of the Q/A pairs was computed, and various strategies like NEFTune were employed to add noise to the embeddings to reduce overfitting risk. The image resolutions were randomly adjusted, and multi-turn interactions were randomly shuffled to input examples into the model.
The evaluation results shown in the table indicate that Idefics2 performs excellently on benchmarks like MMMU, MathVista, TextVQA, and MMBench, not only having higher computational efficiency during inference but also surpassing similarly sized visual language models (LLaVA-Next, DeepSeek-VL, MM1-Chat).

Idefics2’s performance is comparable to that of state-of-the-art models that are four times its size, and it can even compete with the closed-source model Gemini 1.5 Pro on benchmarks like MathVista and TextVQA.
3. Dialogue Scene Optimization
Evaluation benchmarks often expect very brief answers, but humans tend to prefer longer generations when interacting with models. Idefics2 may struggle to accurately follow expected formats in instructions, making it difficult to gauge the “length” of generated responses.
Therefore, after instruction fine-tuning, the authors further trained Idefics2 on dialogue data. Idefics2 underwent several hundred steps of fine-tuning on LLaVA-Conv and ShareGPT4V.
User evaluations indicate that in many interactions, Idefics2-chatty significantly outperforms the version that only underwent instruction fine-tuning. Here are some generation examples:


Conclusion
This article provides a detailed exploration of the effectiveness of commonly used tricks in the literature when building multimodal large models through exhaustive experiments, resulting in a series of valuable conclusions. Moreover, the authors personally applied these useful techniques to successfully construct a high-performance 8B parameter visual language model—Idefics2. Among models of similar scale, Idefics2 demonstrates state-of-the-art performance and higher inference efficiency, providing important references for the research of multimodal large models.
References
[1] Karamcheti, S., S. Nair, A. Balakrishna, P. Liang, T. Kollar, and D. Sadigh (2024). Prismatic vlms: Investigating the design space of visually-conditioned language models. [2] Vallaeys, T., M. Shukor, M. Cord, and J. Verbeek (2024). Improved baselines for data-efficient perceptual augmentation of llms. [3] Mm1: Methods, analysis & insights from multimodal llm pre-training.