Sun Yat-sen University Open Source Diffusion Model Framework

Machine Heart Column

Machine Heart Editorial Department

In recent years, image generation models based on Diffusion Models have emerged one after another, showcasing stunning generation effects. However, the existing research model code frameworks suffer from excessive fragmentation, lacking a unified framework system, leading to challenges such as “difficult migration,” “high barriers,” and “poor quality” in code implementations.

To address this, the HCP Lab at Sun Yat-sen University has constructed the HCP-Diffusion framework, systematically implementing related algorithms based on Diffusion models for model fine-tuning, personalized training, inference optimization, image editing, and more, as shown in Figure 1.

Figure 1 HCP-Diffusion Framework Structure Diagram, unifying existing diffusion-related methods through a unified framework, providing various modular training and inference optimization methods.

HCP-Diffusion uses uniformly formatted configuration files to coordinate various components and algorithms, significantly improving the framework’s flexibility and scalability. Developers can combine algorithms like building blocks without the need to repeatedly implement code details.

For example, based on HCP-Diffusion, we can deploy and combine various common algorithms such as LoRA, DreamArtist, and ControlNet simply by modifying the configuration files. This not only lowers the barriers to innovation but also allows the framework to accommodate various customized designs.

HCP-Diffusion code tools: https://github.com/7eu7d7/HCP-Diffusion
HCP-Diffusion graphical interface: https://github.com/7eu7d7/HCP-Diffusion-webui

HCP-Diffusion: Functional Module Introduction

Framework Features

HCP-Diffusion modularizes the currently mainstream diffusion training algorithm frameworks, achieving the generality of the framework. The main features are as follows:

Unified Architecture: Building a unified code framework for Diffusion series models.
Operator Plugins: Supporting operator algorithms for data, training, inference, performance optimization, etc., such as deepspeed, colossal-AI, and offload acceleration optimization.
One-Click Configuration: The Diffusion series models can be implemented by flexibly modifying configuration files.
One-Click Training: Providing a Web UI for one-click training and inference.

Data Module

HCP-Diffusion supports defining multiple parallel datasets, each dataset can use different image sizes and annotation formats. Each training iteration will draw one batch from each dataset for training, as shown in Figure 2. In addition, each dataset can configure multiple data sources, supporting txt, json, yaml, and other annotation formats or custom annotation formats, with a highly flexible data preprocessing and loading mechanism.

Figure 2 Dataset Structure Diagram

The dataset processing part provides an aspect ratio bucket with automatic clustering, supporting the processing of datasets with varying image sizes. Users do not need to perform additional processing and alignment on dataset sizes; the framework will automatically select the optimal grouping method based on aspect ratio or resolution. This technology significantly lowers the barriers to data processing, optimizing the user experience and allowing developers to focus more on the innovation of the algorithms themselves.

For image data preprocessing, the framework is also compatible with various image processing libraries such as torch vision, albumentations, and more. Users can directly configure preprocessing methods in the configuration files as needed or expand custom image processing methods based on this.

Figure 3 Example of Dataset Configuration File

In terms of text annotation, HCP-Diffusion designs flexible and clear prompt template specifications, supporting complex and diverse training methods and data annotations. It applies the word_names in the source directory of the aforementioned configuration files, where users can customize the special characters in the curly braces in the figure below to correspond to embedding word vectors and category descriptions, compatible with models like DreamBooth and DreamArtist.

Figure 4 Prompt Template

Moreover, for text annotations, it also provides various text enhancement methods such as sentence erasure (TagDropout) or sentence shuffling (TagShuffle), which can reduce the overfitting problem between image and text data, making the generated images more diverse.

Model Framework Modules

HCP-Diffusion modularizes the currently mainstream diffusion training algorithm frameworks, achieving the generality of the framework. Specifically, Image Encoder and Image Decoder complete the encoding and decoding of images, Noise Generator generates noise for the forward process, Diffusion Model implements the diffusion process, Condition Encoder encodes the generation conditions, Adapter fine-tunes the model and aligns it with downstream tasks, while positive and negative dual channels represent the control generation of images with positive and negative conditions.

Figure 5 Example of Model Structure Configuration (Model Plugins, Custom Words, etc.)

As shown in Figure 5, HCP-Diffusion can achieve various mainstream training algorithms like LoRA, ControlNet, DreamArtist through simple combinations in the configuration files. It also supports the combination of the above algorithms, such as training LoRA and Textual Inversion simultaneously, binding exclusive trigger words to LoRA, etc. Furthermore, through plugin modules, any plugin can be easily customized, and it is already compatible with all mainstream methods. Through this modularization, HCP-Diffusion has built a framework for any mainstream algorithm, lowering the development threshold and promoting collaborative innovation of models.

HCP-Diffusion abstracts various Adapter class algorithms such as LoRA and ControlNet into model plugins, allowing all such algorithms to be treated uniformly by defining some common base classes for model plugins, thereby reducing user costs and development costs, unifying all Adapter class algorithms.

The framework provides four types of plugins that can easily support all mainstream algorithms:

+ SinglePluginBlock: A single-layer plugin that changes the output based on the input of that layer, such as the LoRA series. It supports regular expressions (re: prefix) to define insertion layers but does not support pre_hook: prefix.

+ PluginBlock: Both the input and output layers have only one, for example, defining residual connections. It supports regular expressions (re: prefix) to define insertion layers, and both input and output layers support pre_hook: prefix.

+ MultiPluginBlock: Both input and output layers can have multiple, such as controlnet. It does not support regular expressions (re: prefix), and both input and output layers support pre_hook: prefix.

+ WrapPluginBlock: Replaces a layer of the original model, treating the original model’s layer as an object of this class. It supports regular expressions (re: prefix) to define replacement layers but does not support pre_hook: prefix.

Training and Inference Modules

Figure 6 Custom Optimizer Configuration

The configuration files in HCP-Diffusion support defining Python objects, which are automatically instantiated at runtime. This design allows developers to easily integrate any custom modules that can be installed via pip, such as custom optimizers, loss functions, noise samplers, etc., without modifying the framework code, as shown in the above figure. The configuration file structure is clear, easy to understand, and highly reproducible, helping to smoothly connect academic research and engineering deployment.

Acceleration Optimization Support

HCP-Diffusion supports various training optimization frameworks such as Accelerate, DeepSpeed, Colossal-AI, which can significantly reduce memory usage during training and accelerate training speed. It supports EMA operations, which can further improve the model’s generation effect and generalization. In the inference phase, it supports model offload and VAE tiling operations, requiring only 1GB of memory to complete image generation.

Figure 7 Modular Configuration File

Through the simple file configuration mentioned above, users can complete model configuration without spending a lot of effort searching for relevant framework resources, as shown in the above figure. The modular design of HCP-Diffusion completely separates the definition of model methods, training logic, and inference logic, allowing users to focus better on the methods themselves when configuring models. At the same time, HCP-Diffusion has already provided framework configuration examples for most mainstream algorithms; by merely modifying some parameters, deployment can be achieved.

HCP-Diffusion: Web UI Image Interface

In addition to directly modifying configuration files, HCP-Diffusion has provided a corresponding Web UI image interface, including multiple modules for image generation, model training, etc., to enhance user experience, significantly lowering the learning threshold of the framework and accelerating the transformation of algorithms from theory to practice.

Figure 8 HCP-Diffusion Web UI Image Interface

Laboratory Introduction

The HCP Lab at Sun Yat-sen University was founded by Professor Lin Jing in 2010. In recent years, it has achieved rich academic results in multimodal content understanding, causal and cognitive reasoning, embodied learning, and has received multiple domestic and international science and technology awards as well as best paper awards. The lab is committed to creating product-level AI technologies and platforms. Laboratory website: http://www.sysu-hcp.net

Please contact this public account for authorization.

Submissions or inquiries for coverage: [email protected]

Leave a Comment Cancel reply