Fine-Tuning Stable Diffusion to Create Pokémon Worlds

New Intelligence Report

Source: AI Technology Review

Editor: Peach

[New Intelligence Guide] No longer relying on prompt libraries, you can generate with any text.

As a powerful, open, and sufficiently simple model, the recently popular Stable Diffusion has provided everyone with infinite creative possibilities beyond text-to-image generation.

Recently, Justin Pinkney, a machine learning researcher from Lambda Labs, fine-tuned this model to build a Pokémon generator!

Let’s take a look at some interesting examples~

The image below shows some Pokémon generated after inputting names: Girl with a Pearl Earring, Obama, Trump, Boris Johnson, Totoro, Hello Kitty.

Lady Gaga, Boris Johnson, Putin, Merkel, Trump, Plato:

Fine-Tuning Stable Diffusion to Create Pokémon Worlds

Jesus Christ:

In addition to existing characters and public figures, you can also input a description to generate your imagined Pokémon: Skeleton Priest

You can also input your own name or username to generate your own Pokémon image. This is so cool, Twitter users are creating art using their names to see what they would look like as Pokémon.

Caption: User Jo Barf Creepy’s Pokémon image

Caption: User Elizabeth Holmes’s Pokémon image

Caption: User Upbeatblue’s Pokémon image

Caption: User Onion-sama’s Pokémon image

Inputting names of comic characters can also yield matching Pokémon:

And those Pokémon that accompanied people in their childhood also have new appearances in this generator: Pikachu, Bulbasaur, Charizard, Treecko, Lucario, Mew.

How the Pokémon Generator Generates

Pinkney showcased the training process of this Pokémon generator on Twitter.

Portal: https://github.com/LambdaLabsML/examples/tree/main/stable-diffusion-finetuning

He stated that Stable Diffusion is a great general-purpose model, but obtaining specific style outputs is not easy, which often requires a lot of tedious work to create complex text prompt libraries, or you can take a shortcut by simply fine-tuning the image generation model.

Pinkney fine-tuned the initial Stable Diffusion on a dataset of Pokémon images.

First, a dataset was built. The dataset contained Pokémon images and corresponding text descriptions, for example, Bulbasaur was described as “an image of a green Pokémon with red eyes,” while Caterpie was described as “a green-yellow toy with a red nose.”

Caption: Pokémon dataset

Of course, these descriptions were not done manually, but were generated using a neural network, namely the image description model BLIP. Although these descriptions are not perfect, they are sufficient.

Then, he spent only a few hours training the AI model on an A6000, allowing the model to learn to generate images in the style of Pokémon while retaining previous knowledge for a while, ultimately leading to overfitting on the dataset.

Initially, the samples looked like normal images, then gradually obtained the Pokémon style, and as training continued, it eventually presented a Pokémon image that was different from the original prompt:

This is a very simple fine-tuning, but it runs extremely well. With such a fine-tuned model, no matter what prompt you give it, it will generate Pokémon. So there’s no need to painstakingly think of prompts anymore.

When creating Pokémon, you can choose to output multiple:

Caption: Mechanical cat with wings

Pinkney stated that everyone is welcome to apply this model in new fields in more complex ways. Such small tools are a reflection of the benefits of open-source AI models like Stable Diffusion.

One more thing

After this model sparked a creative wave online, Pinkney released a blog post supplementing some additional work details.

He found, surprisingly, that this model managed to retain some general knowledge from the original Stable Diffusion, even though it was trained on a limited dataset for a few thousand steps. However, when fine-tuning for Pokémon, the model quickly starts to overfit, and if simply sampling from it in a straightforward way, the model will generate nonsensical Pokémon for new prompts, meaning it has catastrophically forgotten its original training data. However, Stable Diffusion maintains an exponential moving average (EMA) version of the model during training, which is typically used for inference.

Therefore, if using EMA weights, we are effectively using an average of the original model and the fine-tuned model. It turns out this is essential for generating Pokémon. Additionally, you can fine-tune the effect by averaging the weights of the new model with the initial model to control the quantity of generated Pokémon. The fine-tuning and averaging of the model can effectively mix the original content with the fine-tuned style.

Caption: The left is the fully fine-tuned model, the right is the model that only fine-tuned the attention layer.

Additionally, you can freeze different parts of the model for fine-tuning. For example, the image above shows the generation effects of two fine-tuning methods, where the model that only fine-tuned the attention layer can generate a more normal Yoda but is not very good at creating Pokémon.

References:

https://www.justinpinkney.com/pokemon-generator/

This article is reproduced with permission from WeChat public account “AI Technology Review” (ID: aitechtalk)

New Intelligence Report

[New Intelligence Guide] No longer relying on prompt libraries, you can generate with any text.

Leave a Comment Cancel reply