Unified Model for Controllable Multimodal Image Generation

Machine Heart Column

Machine Heart Editorial Team

Researchers from Salesforce AI, Northeastern University, and Stanford University proposed the MOE-style Adapter and Task-aware HyperNet to achieve multimodal conditional generation capabilities in UniControl. UniControl was trained on nine different C2I tasks, demonstrating strong visual generation capabilities and zero-shot generalization abilities.

Paper link: https://arxiv.org/abs/2305.11147
Code link: https://github.com/salesforce/UniControl
Project homepage: https://shorturl.at/lmMX6

Introduction: Stable Diffusion has shown powerful visual generation capabilities. However, it often underperforms in generating images with spatial, structural, or geometric control. Works like ControlNet [1] and T2I-adapter [2] have achieved controllable image generation for different modalities, but adapting to various visual conditions within a single unified model remains an unresolved challenge. UniControl integrates various controllable conditions for image (C2I) tasks within a single framework. To enable UniControl to handle diverse visual conditions, the authors introduced a task-aware HyperNet to modulate the downstream conditional diffusion model, allowing it to adapt to different C2I tasks simultaneously. UniControl was trained on nine different C2I tasks, demonstrating strong visual generation capabilities and zero-shot generalization abilities. The authors have open-sourced the model parameters and inference code, and the dataset and training code will be open-sourced as soon as possible, inviting everyone to communicate and use it.

Figure 1: The UniControl model consists of multiple pre-trained tasks and zero-shot tasksMotivation:Existing controllable image generation models are designed for a single modality, however, works like Taskonomy [3] have shown that different visual modalities share features and information, thus this paper argues that a unified multimodal model has great potential.Solution:This paper proposes the MOE-style Adapter and Task-aware HyperNet to achieve multimodal conditional generation capabilities in UniControl. Additionally, the authors established a new dataset, MultiGen-20M, containing 9 major tasks and over 20 million image-condition-prompt triplets, with image sizes ≥ 512.Advantages:1) A more compact model (1.4B #params, 5.78GB checkpoint), achieving multiple tasks with fewer parameters. 2) Stronger visual generation capabilities and accuracy of control. 3) Zero-shot generalization ability on unseen modalities.1. IntroductionGenerative foundational models are changing the way AI interacts in fields like natural language processing, computer vision, audio processing, and robotic control. In natural language processing, generative foundational models like InstructGPT or GPT-4 perform excellently across various tasks, and this multi-tasking capability is one of their most attractive features. Moreover, they can also perform zero-shot or few-shot learning to handle unseen tasks.However, in the visual domain, generative models do not prominently exhibit this multi-tasking capability. While text descriptions provide a flexible way to control the content of generated images, they often fall short in providing pixel-level spatial, structural, or geometric control. Recent popular research such as ControlNet and T2I-adapter can enhance the Stable Diffusion Model (SDM) for precise control. However, unlike language prompts that can be processed by unified modules like CLIP, each ControlNet model can only handle its specifically trained modality.To overcome the limitations of previous works, this paper proposes UniControl, a unified diffusion model that can simultaneously handle language and various visual conditions. The unified design of UniControl can benefit from improved training and inference efficiency as well as enhanced controllable generation. On the other hand, UniControl benefits from the inherent connections between different visual conditions to enhance the generation effect of each condition.The unified controllable generation capability of UniControl relies on two parts, one being the “MOE-style Adapter” and the other being the “Task-aware HyperNet”. The MOE-style Adapter has about 70K parameters and can learn low-level feature maps from various modalities, while the Task-aware HyperNet can input task instructions as natural language prompts and output task embedding to modulate the parameters of the downstream network to adapt to different modality inputs.This study pre-trained UniControl to gain multi-task and zero-shot learning capabilities, including nine different tasks across five categories: edges (Canny, HED, Sketch), region mapping (Segmentation, Object Bound Box), skeleton (Human Skeleton), geometry (Depth, Normal Surface), and image editing (Image Outpainting). Subsequently, this study trained UniControl on NVIDIA A100 hardware for over 5000 GPU hours (the current new model is still continuing training). Moreover, UniControl has demonstrated zero-shot adaptability to new tasks.The contributions of this study can be summarized as follows:

This study proposed UniControl, a unified model capable of handling various visual conditions (1.4B #params, 5.78GB checkpoint) for controllable visual generation.
This study collected a new multi-condition visual generation dataset containing over 20 million image-text-condition triplets, covering nine different tasks across five categories.
This study conducted experiments demonstrating that the unified model UniControl, by learning the intrinsic relationships between different visual conditions, outperformed each individual task’s controlled image generation.
UniControl exhibited the ability to adapt to unseen tasks in a zero-shot manner, showcasing its potential for widespread use in open environments.

2. Model Design

Figure 2: Model structure. To accommodate multiple tasks, this study designed the MOE-style Adapter, with approximately 70K parameters per task, and a task-aware Task-aware HyperNet (approximately 12M parameters) to modulate seven zero convolution layers. This structure allows for multi-task capabilities within a single model, ensuring both the diversity of multi-tasking and retaining the underlying parameter sharing. Compared to equivalent stacked single-task models (each model has about 1.4B parameters), it significantly reduces the model size.The UniControl model design ensures two properties:1) Overcoming the misalignment of low-level features from different modalities. This helps UniControl learn necessary and unique information from all tasks. For example, when the model treats the segmentation map as a visual condition, it may overlook 3D information.2) Ability to learn meta-knowledge across tasks. This enables the model to understand shared knowledge between tasks and their differences.To provide these attributes, the model introduces two novel modules: the MOE-style Adapter and the Task-aware HyperNet.The MOE-style Adapter is a set of convolution modules, with each Adapter corresponding to a separate modality, inspired by the mixture of experts model (MOE), used for UniControl to capture features of various low-level visual conditions. This adapter module has about 70K parameters and is highly computationally efficient. The visual features will then be processed in a unified network.The Task-aware HyperNet adjusts the zero convolution modules of ControlNet based on task instruction conditions. The HyperNet first projects the task instructions into task embeddings, and then the researchers inject the task embeddings into the zero convolution layers of ControlNet. Here, the size of the task embedding and the convolution kernel matrix of the zero convolution layer correspond. Similar to StyleGAN [4], this study directly multiplies the two to modulate the convolution parameters, with the modulated convolution parameters serving as the final convolution parameters. Thus, the modulated zero convolution parameters for each task are different, ensuring the model’s adaptability to each modality, and all weights are shared.3. Model TrainingUnlike SDM or ControlNet, which have single language prompts or single types of visual conditions as image generation conditions, UniControl needs to handle various visual conditions from different tasks, as well as language prompts. Therefore, the input to UniControl consists of four parts: noise, text prompt, visual condition, and task instruction. The task instruction can be naturally derived based on the modality of the visual condition.

With such generated training pairs, this study employed DDPM [5] to train the model.4. Experimental Results

Figure 6: Visual comparison results from the test set. Test data comes from MSCOCO [6] and Laion [7]Compared to the official or replicated results of ControlNet as shown in Figure 6, more results can be found in the paper.5. Zero-shot Task GeneralizationThe model was tested for zero-shot capability in the following two scenarios:Mixed task generalization: This study considered two different visual conditions as input to UniControl, one being a segmentation map and the other a human skeleton mix, adding specific keywords “background” and “foreground” to the text prompt. Additionally, this study rewrote the mixed task instructions to combine the instructions of the two tasks, such as “segmentation map and human skeleton to image”.New task generalization: UniControl needs to generate controllable images based on new unseen visual conditions. To achieve this, estimating task weights based on the relationships between unseen and seen pre-trained tasks is crucial. Task weights can be estimated either manually or by calculating the similarity score of task instructions in the embedding space. The MOE-style Adapter can be linearly assembled with the estimated task weights to extract shallow features from new unseen visual conditions.Visual results are shown in Figure 7, more results can be found in the paper.

Figure 7: Visual results of UniControl on zero-shot tasks6. ConclusionOverall, the UniControl model provides a new foundational model for controllable visual generation through its control diversity. This model enables the possibility of achieving higher levels of autonomy and human control in image generation tasks. This study looks forward to discussing and collaborating with more researchers to further advance the development of this field.More Visual Effects

[1] Zhang, Lvmin, and Maneesh Agrawala. “Adding conditional control to text-to-image diffusion models.” arXiv preprint arXiv:2302.05543 (2023).[2] Mou, Chong, et al. “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models.” arXiv preprint arXiv:2302.08453 (2023).[3] Zamir, Amir R., et al. “Taskonomy: Disentangling task transfer learning.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.[4] Karras, Tero, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarial networks.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.[5] Ho, Jonathan, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilistic models.” Advances in Neural Information Processing Systems 33 (2020): 6840-6851. APA [6] Lin, Tsung-Yi, et al. “Microsoft coco: Common objects in context.” Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014.[7] Schuhmann, Christoph, et al. “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs.” arXiv preprint arXiv:2111.02114 (2021).

For reprints, please contact this public account for authorization

For submissions or inquiries: [email protected]

Leave a Comment Cancel reply