GPT-3 and Stable Diffusion: Making AI Understand Image Editing Requests

Machine Heart reports

Machine Heart Editorial Team

Can AI completely edit images according to client requests? With the help of GPT-3 and Stable Diffusion, models can instantly become Photoshop experts, editing images at will.

After the rise of diffusion models, many people have focused on how to use more effective prompts to generate the desired images. Through continuous experimentation with AI drawing models, people have even summarized key phrases that help AI produce good quality images:

In other words, if you master the correct AI phrasing, the improvement in image quality will be very noticeable (see: “How to Draw ‘Llama Playing Basketball’? Someone Paid $13 to Challenge DALL·E 2 to Show Its True Skills”).

Additionally, some researchers are working in another direction: how to verbally change an image to our desired form.

Recently, we reported on a study from Google Research and other institutions. Just by stating how you want an image to change, it can basically meet your requirements and generate photo-realistic images, such as making a dog sit:

The input description for the model is “a sitting dog”, but according to everyday communication habits, the most natural description should be “make this dog sit”. Researchers believe this is an issue that needs optimization, and the model should align better with human language habits.

Recently, a research team from UC Berkeley proposed a new method for editing images based on human instructions called InstructPix2Pix: given an input image and a text description telling the model what to do, the model can follow the description to edit the image.

Paper link: https://arxiv.org/pdf/2211.09800.pdf

For example, to replace a sunflower with a rose in a painting, you simply need to tell the model “replace the sunflower with a rose”:

To obtain training data, the study combined two large pre-trained models—language model (GPT-3) and text-to-image generation model (Stable Diffusion)—to generate a large paired training dataset of image editing examples. The researchers trained the new model InstructPix2Pix on this large dataset and generalized it to real images and user-written instructions during inference.

InstructPix2Pix is a conditional diffusion model that generates edited images given an input image and a text instruction for editing the image. The model performs image editing directly during the forward pass without needing any additional example images, complete descriptions of input/output images, or fine-tuning for each example, allowing the model to quickly edit images in just a few seconds.

Although InstructPix2Pix was trained entirely on synthetic examples (i.e., text descriptions generated by GPT-3 and images generated by Stable Diffusion), the model achieved zero-shot generalization to any real image and human-written text. The model supports intuitive image editing, including object replacement, style changes, and more.

Method Overview

The researchers view instruction-based image editing as a supervised learning problem: first, they generated a paired training dataset containing text editing instructions and images before and after editing (Figure 2a-c), and then trained an image editing diffusion model on this generated dataset (Figure 2d). Although generated images and editing instructions were used during training, the model is still able to edit real images using arbitrarily written instructions. Below is an overview of the method in Figure 2.

Generating a Multimodal Training Dataset

During the dataset generation phase, the researchers combined the capabilities of a large language model (GPT-3) and a text-to-image model (Stable Diffusion) to create a multimodal training dataset containing text editing instructions and corresponding images before and after editing. This process includes the following steps:

Fine-tune GPT-3 to generate a collection of text editing content: given a prompt describing the image, generate a text instruction describing the changes to be made and a prompt describing the image after the change (Figure 2a);
Use the text-to-image model to convert the two text prompts (i.e., before editing and after editing) into a pair of corresponding images (Figure 2b).

InstructPix2Pix

The researchers used the generated training data to train a conditional diffusion model based on the Stable Diffusion model, which can edit images based on written instructions.

The diffusion model learns to generate data samples through a denoising autoencoder that estimates data distribution scores (pointing towards high-density data). Latent diffusion improves the efficiency and quality of the diffusion model by operating in the latent space of a pre-trained variational autoencoder with an encoder and decoder to enhance performance.

For an image x, the diffusion process adds noise to the encoded latent to produce a noisy latent z_t, where the noise level increases with the time step t∈T. The researchers learn a network that predicts the noise added to the noisy latent z_t, given the image modulation C_I and text instruction modulation C_T.The researchers minimize the following latent diffusion objective:

Previous studies (Wang et al.) have shown that for image translation tasks, especially when paired training data is limited, fine-tuning large image diffusion models is superior to training from scratch. Therefore, in this new study, the authors initialized the model weights using a pre-trained Stable Diffusion checkpoint, leveraging its strong text-to-image generation capabilities.

To support image modulation, the researchers added extra input channels to the first convolutional layer, connecting z_t and. All available weights of the diffusion model were initialized from the pre-trained checkpoint, while the weights running on the newly added input channels were initialized to zero. The authors reused the same text modulation mechanism originally used for captions without inputting the text editing instruction c_T.

Experimental Results

In the following images, the authors showcase the image editing results of their new model. These results are for a set of different real photos and artworks. The new model successfully performed many challenging edits, including object replacement, changing seasons and weather, replacing backgrounds, modifying material properties, converting art media, and more.

The researchers compared the new method with some recent techniques, such as SDEdit and Text2Live. The new model follows editing instructions for images, while other methods (including benchmark methods) require descriptions of the images or editing layers. Therefore, in the comparison, the authors provided “after editing” text annotations instead of editing instructions for the latter. The authors also quantitatively compared the new method with SDEdit, using two metrics measuring image consistency and editing quality. Finally, the authors demonstrated how the size and quality of the generated training data affect model performance through ablation results.

Amazon Cloud Technology’s “Deep Learning Practical Training Camp” is About to Start

For new developers, getting started with deep learning is not always an easy task.

Want to quickly enhance your practical skills? Machine Heart, in collaboration with Amazon Cloud Technology, is launching the “Deep Learning Practical Training Camp” online boot camp. The boot camp lasts for 3 weeks, with a total of 6 courses, plus 6 homework assignments. During the boot camp, instructors will be available in the Q&A group to address students’ questions at any time, completely free of charge. Students interested in practical deep learning are welcome to join the learning.

Starting November 22, for inquiries about Q&A, homework, and rewards, please join the group for more information. Finally, here is the detailed course schedule.

GPT-3 and Stable Diffusion: Making AI Understand Image Editing Requests

For reprints, please contact this public account for authorization

Submissions or inquiries for coverage: [email protected]

Leave a Comment Cancel reply