Source: I Love Computer Vision

This paper shares the work titledCreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation, proposed by Fudan University and ByteDance, introducing a new paradigm for layout-to-image generation that supports controllable image generation under the MM-DiT architecture based on layouts!

CreatiLayout: A New SOTA for Layout-to-Image Generation

Paper link: https://arxiv.org/abs/2412.03859
Project homepage: https://creatilayout.github.io
Project code: https://github.com/HuiZhang0812/CreatiLayout
Project Demo: https://huggingface.co/spaces/HuiZhang0812/CreatiLayout
Dataset: https://huggingface.co/datasets/HuiZhang0812/LayoutSAM

Task Background

Layout-to-Image (L2I) generation is a technology for controllable image generation based on layout information, where the layout information includes the spatial positions and descriptions of entities in the image. For example, a user specifies the descriptions and spatial positions of these entities: Iron Man holding a drawing board, standing on a rock, with the text “CreatiLayout” written in hand-drawn font on the board, and the background being a seaside sunset. Layout-to-Image can generate an image that meets the user’s requirements based on this information.

Layout-to-Image further unlocks the capabilities of Text-to-Image models, providing users with precise control and creative expression channels, with broad application prospects in game development, animation production, interior design, and creative design.

Previous Layout-to-Image models mainly faced the following issues:

Layout Data Issues: Existing layout datasets are limited by small-scale data in a closed set and coarse-grained entity annotations, which restrict the model’s ability to generalize to generating open-set entities and the precision of generating entities with complex attributes.
Model Architecture Issues: Previous models mainly focused on the U-Net architecture, such as SD1.5 and SDXL. However, with the development of MM-DiT, models like SD3 and FLUX have achieved new heights in visual quality and text adherence. Directly applying the U-Net’s layout control paradigm to MM-DiT would weaken the accuracy of layout control. Therefore, a new framework needs to be designed for MM-DiT to efficiently integrate layout information and fully leverage its potential.
User Experience Issues: Many existing methods only support bounding boxes as a way for users to specify entity locations, lacking the capability to process more flexible input methods (such as center points, masks, sketches, or just language descriptions), which limits the user experience. Additionally, these methods do not support optimization for adding, deleting, or modifying the user’s layout.

Method Overview

To address the issues in data, models, and experience, CreatiLayout proposes targeted solutions to achieve higher quality and more controllable layout-to-image generation.

1. Large-Scale & Fine-Grained Layout Dataset: LayoutSAM

CreatiLayout constructed an automatic labeling chain for layouts and proposed a large-scale layout dataset, LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entity annotations.

LayoutSAM is filtered from the SAM dataset, featuring open-set entities, fine-grained annotations, and high image quality. Each entity includes bounding boxes and detailed descriptions, covering complex attributes such as color, shape, and texture. This provides a data-driven foundation for the model to better understand and learn layout information.

Based on this, CreatiLayout established the layout-to-image generation evaluation benchmark LayoutSAM-Eval, comprehensively assessing model performance in layout control, image quality, and text adherence.

2. Model Architecture Viewing Layout Information as a Modality: SiamLayout

CreatiLayout proposed the SiamLayout framework, introducing layout information into MM-DiT while effectively alleviating the modality competition problem and enhancing the guiding role of layouts, achieving more precise layout control compared to other network solutions. The core design points are:

Viewing layout information as an independent modality, equally important as text and image modalities, enhancing the guiding degree of layout information on image content.
The interaction between layout modality and image modality is realized through MM-DiT’s native MM-Attention, preserving its advantages in modality interaction.
Decoupling the interaction of the three modalities (image, text, layout) into two twin branches: image-text interaction branch and image-layout interaction branch, allowing text and layout to guide image content independently without interference.

3. Layout Designer Supporting Layout Generation and Optimization: LayoutDesigner

CreatiLayout proposed LayoutDesigner, which uses large language models for layout planning, capable of generating and optimizing layouts based on user input (center points, masks, sketches, text descriptions), supporting more flexible user input methods, and providing layout optimization functions such as adding, deleting, and modifying entities. This allows users to express their design intentions more conveniently and generate more harmonious and aesthetically pleasing layouts.

Experimental Results

1. Comparison Experiments with SOTA Methods in Layout-to-Image Generation

In the fine-grained open-set layout-to-image generation task, CreatiLayout outperforms previous SOTA methods in spatial positioning, color, texture, shape, and other regional attribute rendering; in overall image quality, CreatiLayout also exhibits better visual quality and text adherence.

The visual results below further confirm the advantages of CreatiLayout. For example, it generates more precise outputs for the text “HELLO FRIENDS” and different colors of pencils and benches. The project’s demo allows further exploration of CreatiLayout’s capabilities in Layout-to-Image.

2. Comparison Experiments with SOTA Methods in Layout Generation and Optimization

The quantitative and qualitative experiments on layout planning tasks demonstrate the layout generation and optimization capabilities of different layout optimizers under varying user input granularities. LayoutDesigner performs excellently in layout planning tasks based on global titles, center points, and bounding boxes, achieving a format accuracy of 100%, indicating its ability to generate layouts that meet format requirements.

Furthermore, generating images based on layouts planned by LayoutDesigner yields higher quality and more aesthetically pleasing images. For example, layouts generated by Llama3.1 often lack key elements, while layouts generated by GPT4 frequently violate basic physical laws, leading to poor image quality and low text adherence based on these suboptimal layouts.