Understanding Text-to-Image Models in AI Art

Understanding Text-to-Image Models in AI Art

Introduction

AI art generation has started to enter the public eye. In the past year, a large number of text-to-image models have emerged, especially with the advent of Stable Diffusion and Midjourney, sparking a wave of AI art creation. Many artists have also begun to experiment with AI to assist in their artistic endeavors. This article will systematically review the text-to-image algorithms that have appeared in recent years, helping readers gain a deeper understanding of the principles behind them.

To explore the intersection of complex science and humanistic art, the Collective Intelligence Club is hosting the “Complex Science and Art” seminar series, gathering actors and thinkers from various fields—including scientists, artists, scholars, and related practitioners—to engage in interdisciplinary discussions and collaborative outputs. The seminar series began in July 2022 and will run monthly for a total of twelve sessions. AI-generated art is one of the themes of the seminar. Friends interested in this topic are welcome to sign up. Details and registration links can be found at the end of this article.

Understanding Text-to-Image Models in AI Art

Hu Pengbo | Author

Zhu Sijia | Typesetting

Shisanwei | Proofreading

Table of Contents

  1. Based on VQ-VAE

  • AE

  • VAE

  • VQ-VAE

  • DALL-E

  • Based on GAN

    • VQGAN

    • VQGAN-CLIP

    • DALL-E Mini

    • Parti

    • NUWA-Infinity

  • Based on Diffusion Model

    • Diffusion Model

    • GLIDE

    • DALL-E2

    • Imagen

    • Stable Diffusion

  • Model Trials

  • Summary

  • Based on VQ-VAE

    Before understanding the principles of VQ-VAE, it is essential to first understand AE (AutoEncoder) and VAE (Variational Autoencoders), as these models belong to self-supervised learning methods. Next, this article will provide a brief introduction to them.

    AE

    Autoencoders consist of an encoder and a decoder (as shown in the figure below)[1]. They first compress the image and then reconstruct the compressed representation. In practical applications, autoencoders are often used for dimensionality reduction, denoising, anomaly detection, or neural style transfer.

    Understanding Text-to-Image Models in AI Art
    Since the goal of the autoencoder is to reconstruct the input, its loss function is: , where, is the input, is the reconstruction of , and it only requires simple end-to-end training to obtain an autoencoder.
    VAE
    Unlike AE, VAE does not learn a continuous representation but rather directly learns a distribution, and then samples from this distribution to obtain an intermediate representation to reconstruct the original image[2].
    Understanding Text-to-Image Models in AI Art
    VAE assumes that the intermediate representation follows a normal distribution, so the encoder part needs to map the original image to a normal distribution , and through the reparameterization trick, obtain the sampled intermediate representation , where ϵ is sampled from a standard normal distribution. Subsequently, the decoder uses the intermediate representation z to perform decoding operations to obtain the reconstruction of the original image. The loss function of VAE is defined as:
    Where,
    In the loss function of VAE, the first term aims to allow the model to reconstruct the input, while the second term aims to make the distribution output by the decoder as close as possible to the standard normal distribution. This has the benefit of forcing to be close to the standard normal distribution so that during generation, one can directly sample from the normal distribution and then generate images through the decoder.
    VQ-VAE
    The primary issue with VAE is that it uses a fixed prior (normal distribution) and employs continuous intermediate representations, which can lead to poor diversity and controllability in image generation. To address this, VQ-VAE (Vector Quantized Variational Autoencoder) opts for discrete intermediate representations, and typically uses an autoregressive model to learn the prior (such as PixelCNN or Transformer). In VQ-VAE, the intermediate representation is stable and diverse enough to significantly influence the output of the decoder, aiding in the generation of rich and varied images. Consequently, many subsequent text-to-image models are based on VQ-VAE[3].
    Understanding Text-to-Image Models in AI Art
    The algorithmic flow of VQ-VAE is as follows:
    1. First, set K vectors as a queryable Codebook.
    2. The input image is passed through the encoder CNN to obtain N intermediate representations, and then, using the nearest neighbor algorithm, query the vectors in the Codebook that are most similar to these N intermediate representations.
    3. Place the queried similar vectors from the Codebook in the corresponding positions to obtain .
    4. The decoder reconstructs the image using the obtained intermediate representations .
    The core part of VQ-VAE is the Codebook query operation. By using a highly consistent Codebook to replace chaotic intermediate representations, it can effectively enhance the controllability and richness of image generation. The loss function of VQ-VAE is defined as:
    Understanding Text-to-Image Models in AI Art
    Where, sg is the gradient stopping operation, meaning that the module where sg is located will not undergo gradient updates.
    The first term of the loss function primarily addresses the encoder and decoder. In this process, since the intermediate codebook query operation is discrete, the gradient of is directly copied to , thus forming a gradient backpropagation. The second term is known as VQ loss, which aims to train the codebook module e. Since is fixed here, it will force the codebook module e to move closer. The third term is termed commitment loss, where only the gradient of changes, and its purpose is to pull closer to the codebook module, thereby stabilizing the output of the encoder module.
    In VAE, since the intermediate representation follows a standard normal distribution, random sampling from the standard normal distribution suffices for generation. In VQ-VAE, however, randomly selecting N codebooks cannot guarantee the generation of the expected image, necessitating a model to learn to generate specific intermediate representations to produce effective images (also known as learning the prior).
    Therefore, in the original VQ-VAE paper, the authors used PixelCNN to learn the prior. First, they used the trained VQ-VAE to obtain the intermediate discrete encodings of the training data to serve as the corpus for training the autoregressive model PixelCNN. Subsequently, during generation, PixelCNN is directly used to generate an intermediate discrete representation, which is then matched with the Codebook to generate images using the Decoder.
    DALL-E
    DALL-E, developed by OpenAI, uses VQ-VAE in its first generation and is one of the most popular text-to-image models currently. The first generation of DALL-E is not publicly available, so those wishing to try it can directly use the DALL-E-Mini version recreated by netizens.
    The most notable feature of the first generation of DALL-E is its excellent understanding of semantics and its ability to generate various unconventional yet semantically coherent images[4,5].
    Understanding Text-to-Image Models in AI Art
    The generation module in the DALL-E model uses VQ-VAE; however, its prior learning uses text-to-intermediate discrete representation mapping. The specific steps are as follows:
    1. Train a dVAE (referred to as dVAE in the article, which is actually a VQ-VAE, and will not be discussed further here), where the number of Codebook entries is 8192.
    2. Train an autoregressive model, here using a Transformer, to predict the intermediate representation from the input text.
    During the generation process, text is directly input, and the Transformer predicts the intermediate representation , which is then used by the dVAE’s Decoder module to generate the final image. In the DALL-E paper, the authors also proposed many technical details, such as using the CLIP model to select the images with the highest similarity to the text when making final selections, as well as distributed training and mixed precision training, etc. For specific details, please refer to the original paper.

    Based on GAN

    Generative Adversarial Networks (GAN) consist of two main modules: the generator and the discriminator. The generator is responsible for generating an image, while the discriminator assesses the quality of this image, determining whether it is a real sample or a generated fake sample. Through iterative processes, the generator can produce increasingly realistic images, while the discriminator becomes more precise in judging the authenticity of images. The greatest advantage of GAN is its independence from prior assumptions, learning the data distribution through iterative processes [6].
    The original definition of GAN is:
    where
    When we fix , maximizing means that if the data comes from real data , we need to approach 1, while when the data comes from the generator , we need it to approach 0; this means that the discriminator needs to judge real data as 1 and generated data as 0, at which point, we can optimize the discriminator . When we fix the discriminator , minimizing requires the generator to produce data that closely resembles real data.
    In simple terms, the training process of a GAN is as follows:
    1. Initialize a generator and a discriminator .
    2. Fix the parameters of the generator and only update the parameters of the discriminator . The specific process involves selecting a portion of real samples and generating some samples from the generator, feeding them into the discriminator , which must determine which samples are real and which are generated, optimizing the discriminator based on the error with the real results.
    3. Fix the parameters of the discriminator , only updating the parameters of the generator . The specific process involves using the generator to produce a portion of samples, feeding the generated samples into the discriminator , which will judge them, optimizing the generator’s parameters so that the discriminator leans towards judging them as real samples.
    The GAN model introduced above is merely the original GAN; over time, GAN has been applied to various fields, leading to the emergence of many variants. Next, we will introduce a well-known text-to-image model based on GAN: VQGAN-CLIP.
    VQGAN
    Having already discussed the basic principles of GAN, VQGAN (Vector Quantized Generative Adversarial Networks) is a variant of GAN (as shown in the figure below)[7], inspired by VQ-VAE, utilizing a codebook to learn discrete representations.
    Specifically, a predefined number of vectors serve as a discrete feature query table. When an image is input into the CNN Encoder, N intermediate representations of the image are obtained, and then the most similar representation vectors are queried from the Codebook, resulting in representations, which can be described by the formula:
    Subsequently, the CNN Decoder reconstructs the image based on the obtained representations.
    Understanding Text-to-Image Models in AI Art
    The above steps are very similar to VQ-VAE; however, VQGAN differs in that these steps correspond solely to the generator in GAN, thus requiring a discriminator to assess the generated images. Unlike traditional GAN, the discriminator here does not judge each image but rather assesses each image’s Patch.
    For the training of the generator in VQGAN, its loss function is very similar to that of VQ-VAE, defined as:
    Understanding Text-to-Image Models in AI Art
    The training loss function of VQGAN is defined as:
    Understanding Text-to-Image Models in AI Art
    Combining the above loss functions for training the generator, the complete formula can be expressed as:
    Understanding Text-to-Image Models in AI Art
    In reality, the loss function remains quite consistent with that of GAN; the difference is that the optimization for the generator portion must employ the same methods as VQ-VAE.
    After training VQGAN, during generation, one can directly initialize a to generate. However, to ensure stable , a model is needed to learn the prior. Here, a Transformer model is used to learn the sequence of discrete representations in , which can be simply modeled as an autoregressive model. Thus, by providing an initial random vector, one can generate a complete through the Transformer model, which can then be used by the CNN Decoder module to generate the final image.
    VQGAN-CLIP
    VQGAN-CLIP is also a very popular text-to-image model, and some open text-to-image platforms utilize VQGAN-CLIP[8].
    VQGAN-CLIP guides the VQGAN model through textual description information, ultimately generating images that closely resemble the text description. The specific process is illustrated in the figure below:
    Understanding Text-to-Image Models in AI Art
    Specifically, at the beginning, an image needs to be initialized, which can be randomly generated pixels. At this time, the model starts iterating from 0, or it can initialize a pre-drawn original image, in which case the model’s iteration is akin to repainting this image. Through the Encoder module of VQGAN, the intermediate discrete Z-vector representation is obtained, which is the same as that in VQGAN.
    Using the CLIP model, the similarity between the features of the generated image and the specified text is compared, adjusting the intermediate representation vector Z-vector, thereby allowing the VQGAN module to generate images consistent with the text description. As shown in the figure, in addition to the VQGAN and CLIP modules, there are also Random Crops and Augmented Images; these operations are intended to enhance image stability, and experiments have shown that adding these two operations is beneficial for optimization.
    Below is a model generated based on VQGAN-CLIP, which can produce high-quality images through complex descriptions.
    Understanding Text-to-Image Models in AI Art
    DALL-E Mini
    DALL-E Mini is a version recreated by netizens based on DALL-E[8]. Unlike the original DALL-E, DALL-E Mini does not use the original VQ-VAE, but rather utilizes VQGAN. The DALL-E Mini model is significantly smaller than the original DALL-E model, and the training samples used are also relatively fewer[9].
    DALL-E Mini first employs the BART model (a Sequence-to-Sequence model) to learn the mapping from text to image, converting text into discrete image representations.
    Understanding Text-to-Image Models in AI Art
    In the image generation step, text can be directly input into BART to obtain image discrete representations, followed by using the VQ-GAN Decoder module to decode the discrete image representations into complete images, and finally using CLIP to filter the images, resulting in the final generated output.
    Understanding Text-to-Image Models in AI Art
    Parti
    Shortly after Imagen was released (for an introduction to Imagen, see the diffusion model section), Google proposed a new text-to-image model called Parti, which stands for “Pathways Autoregressive Text-to-Image.” Intuitively, it uses Google’s newly proposed Pathway language model[10].
    Understanding Text-to-Image Models in AI Art
    Unlike Imagen, Parti returns to the original approach of text-to-image generation, not directly using text representations as conditions for the diffusion model paradigm to generate images, but instead using the Pathway language model to learn the mapping from text representations to image representations, similar to DALL-E2, learning a prior model. Additionally, Parti employs a VQGAN-based approach rather than a diffusion model.
    Specifically, Parti first trains a ViT-VQGAN model, and then uses the Pathway language model to learn the mapping from text to image tokens. Due to the powerful sequence prediction capability of the Pathway language model, the output image representations are excellent. During the prediction process, one only needs to map the text to image representations and then use the ViT-VQGAN decoder module for decoding.
    Understanding Text-to-Image Models in AI Art
    The standout feature of Parti is its 20B large model, which supports complex semantic understanding. The following figure illustrates the generation effects of different parameter models for the text description: “A portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing on the grass in front of the Sydney Opera House, holding a sign that says ‘Welcome Friends’!
    Understanding Text-to-Image Models in AI Art
    NUWA-Infinity
    NUWA-Infinity is an infinite visual generation model developed by the NUWA team at Microsoft Research Asia, based on previous work. Its notable feature is that it can continue existing paintings, especially achieving stunning effects for landscape paintings. Additionally, this model supports text-to-image generation, animation generation, and other tasks, but since its primary innovation lies in image continuation, only this feature will be detailed here[11].
    Understanding Text-to-Image Models in AI Art
    To achieve this functionality, the authors proposed a global autoregressive model nested within a local autoregressive generation mechanism, where the global autoregressive model captures the dependencies between visual patches (patch-level), while the local autoregressive model captures the dependencies between visual tokens (token-level). This can be expressed mathematically as:
    In essence, it is a global autoregressive model (n patches) embedded with a local autoregressive model (m tokens).
    In NUWA-Infinity, the authors also proposed two mechanisms: Nearby Context Pool (NCP) and Arbitrary Direction Controller (ADC). The ADC is responsible for segmenting the image into patches and determining the direction of the patches, as illustrated below, where the left image defines the order during training, and the right image defines the order during inference.
    Understanding Text-to-Image Models in AI Art
    As the image size increases, the number of patches may exceed the maximum length that the autoregressive model can handle. Therefore, a mechanism for adding new patches and removing old patches is necessary to ensure that the autoregressive model continuously performs sequence learning in the vicinity of the patches that need to be generated.
    Understanding Text-to-Image Models in AI Art
    During the training process of the model, the image is first divided into patches, and then a random patch generation order is selected, corresponding to the global autoregressive operation. For each patch, its neighboring patches are selected, along with position encoding and text information, and sent into the autoregressive model to obtain the predicted intermediate discrete representation. Simultaneously, for previous patches, the trained VQ-GAN is used to generate intermediate discrete representations, with the model’s goal being to ensure that is sufficiently close to . Intuitively, the model essentially trains a model that predicts the current patch’s intermediate discrete representation based on the intermediate discrete representations of neighboring patches and the text representation.
    Understanding Text-to-Image Models in AI Art
    In inference tasks for image continuation, one simply inputs the image into the model, selects K patches as conditions, initializes the NCP, and then predicts the next patch using the selected patches combined with text information. Finally, the VQGAN Decoder is used to decode the predicted patch’s intermediate discrete representation into an image. Through continuous iterations, the image continuation functionality is ultimately achieved.

    Based on Diffusion Model

    Unlike VQ-VAE and VQ-GAN, diffusion models are currently the core method in the field of text-to-image generation. The most well-known and popular text-to-image models, such as Stable Diffusion, Disco-Diffusion, Mid-Journey, and DALL-E2, are all based on diffusion models. This section will provide a detailed introduction to the principles of diffusion models and the algorithms based on them.

    Diffusion Model
    Recall that VQ-VAE and VQ-GAN both first map images to intermediate latent variables through an encoder, and then the decoder reconstructs the images from these intermediate latent variables. In fact, diffusion models essentially do the same thing, but they employ a completely new approach to achieve this goal[12,13,14].
    Diffusion models consist of two main processes: the forward diffusion process and the reverse denoising process. The forward diffusion process primarily transforms an image into random noise, while the reverse denoising process restores a randomly noisy image back into a complete image.
    Understanding Text-to-Image Models in AI Art
    To aid understanding, a classic diffusion model will be selected for introduction. For detailed derivations of diffusion models, please refer to[13,14].
    Forward Diffusion Process
    The essence of the forward diffusion process is to randomly add noise to the original image. Through T iterations, the distribution of the original image is ultimately transformed into a standard Gaussian distribution. Specifically, given the initial data distribution , the process of adding noise can be defined by the following formula:
    Understanding Text-to-Image Models in AI Art
    Where: , in this process, as t increases, the final data distribution x becomes an independent Gaussian distribution.
    It is worth noting that, since the parameters of the forward diffusion process are predefined, there are no parameters to learn in the forward process, allowing the results at each moment to be directly computed. Here, we first define , thus:
    Understanding Text-to-Image Models in AI Art
    Thus, the distribution expression for the forward diffusion process is obtained as:
    Reverse Diffusion Process
    The reverse process is the restoration process, which involves recovering the original distribution from Gaussian noise. In practice, it suffices to learn the distribution, which can be approximated using a learnable neural network, defined as follows:
    Understanding Text-to-Image Models in AI Art
    Where, since it cannot be directly estimated, a neural network model is generally used for approximation. It is important to note that in the original paper, the variance does not need to be trained and is pre-selected: , while here, due to the properties of hidden Markov models, is conditionally independent, thus, the latter can be expressed using a direct expression, enabling subsequent optimization calculations. This expression can be written as:
    Through derivation, it can be obtained that (for details, see[13,14]):
    Understanding Text-to-Image Models in AI Art
    Diffusion models essentially learn data distributions, thus their log likelihood can be expressed as:
    Understanding Text-to-Image Models in AI Art
    Finally, their loss can be expressed as[13,14]:
    Understanding Text-to-Image Models in AI Art
    After simplification, the final loss expression can be obtained. From the formulaic form, its goal is to predict the noise at each step:
    Understanding Text-to-Image Models in AI Art
    Training Process
    Intuitively, diffusion models predict the noise added at each step of the diffusion model through a neural network . The algorithmic process is as follows:
    Understanding Text-to-Image Models in AI Art
    After training is complete, sampling can be conducted using the reparameterization trick, with the specific process illustrated in the right diagram above. By continuously “subtracting” the noise predicted by the model, a complete image can gradually be generated.
    Classifier-Free Guidance Diffusion
    Based on traditional diffusion models, subsequent improvements have made diffusion models widely applicable to text-to-image generation tasks. Among these, the most commonly used improved version is Classifier-Free Guidance Diffusion[15].
    In the above diffusion model, noise estimation is conducted through , while the guided diffusion model requires the guiding condition to be included in the model input, thus obtaining . Classifier-Free Guidance Diffusion combines conditional and unconditional noise estimation models, defined as:
    This approach’s advantage is that the training process is very stable and frees itself from the constraints of classifiers (which is essentially equivalent to learning an implicit classifier). The downside is that the cost is relatively high, as it involves generating two outputs each time. Nevertheless, most well-known text-to-image models are based on this method.
    GLIDE
    GLIDE uses text as a condition to implement a text-guided diffusion model. In terms of text guidance, the article primarily employs two strategies: Classifier-Free Diffusion Guidance and CLIP as conditional supervision, while also utilizing a larger model, with data volume comparable to DALL-E[16].
    The core of GLIDE is Classifier-Free Diffusion Guidance, which uses text descriptions as guidance to train a diffusion model, defined as:
    Where y is a text description.
    Since the GLIDE method was proposed relatively early, its performance is not as strong as many existing methods. Below is an example of an image generated by GLIDE.
    Understanding Text-to-Image Models in AI Art
    GLIDE also supports image editing operations via region selection + text prompt, and the results are also quite good. During the process, one simply needs to mask the area to be obscured and input the remaining image into the network to produce the completed image.
    Understanding Text-to-Image Models in AI Art
    Additionally, GLIDE’s semantic understanding capability is not very strong, making it challenging to produce logically coherent images under some rare text descriptions, whereas DALL-E2 excels in this aspect.
    Understanding Text-to-Image Models in AI Art
    DALL-E2
    DALL-E2 is OpenAI’s latest AI image generation model, notable for its impressive understanding and creativity. It has approximately 3.5B parameters, and compared to the previous generation, DALL-E2 can generate images at four times the resolution while closely adhering to semantic information. The authors employed a manual evaluation method, asking volunteers to review 1000 images, with 71.7% believing that it matched the text description better, and 88.8% finding the images more appealing than the previous generation[17,18].
    Understanding Text-to-Image Models in AI Art
    DALL-E2 consists of three modules:
    • CLIP model, aligning image and text representations
    • Prior model, receiving text information and converting it into CLIP image representations
    • Diffusion model, receiving image representations to generate complete images
    Understanding Text-to-Image Models in AI Art
    The training process of DALL-E2 is as follows:
    • Train a CLIP model to align text and image features.
    • Train a prior model, either an autoregressive model or a diffusion prior model (experiments have shown that the diffusion prior model performs better), which functions to map text representations to image representations.
    • Train a diffusion decoder model, which aims to restore the original image based on image representations.
    Once training is complete, the inference process is quite straightforward: first, the CLIP text encoder is used to obtain the text encoding, then the prior model maps the text encoding to image encoding, and finally, the diffusion decoder generates the complete image using the image encoding. Note that the diffusion decoder model uses a modified GLIDE diffusion model, generating images at a size of 64×64, followed by using two upsampling diffusion models to upscale to 256×256 and 1024×1024.
    The original DALL-E2 paper also mentions several shortcomings, such as the tendency to confuse objects and attributes, and the inability to accurately place text within images. However, these do not dampen enthusiasm for text-to-image generation, and DALL-E2 is widely applied in various artistic creation processes.
    Imagen
    Shortly after DALL-E2 was proposed, Google introduced a new text-to-image model called Imagen[19]. The paper states that the images it generates possess greater realism and language understanding capabilities compared to DALL-E2 (using a new evaluation method called DrawBench).
    Understanding Text-to-Image Models in AI Art
    The image generation process of Imagen is very similar to that of DALL-E2. First, the text is encoded into a representation, and then the diffusion model maps this representation into a complete image, also employing two diffusion models to further enhance resolution. Unlike DALL-E2, Imagen directly encodes text information using the T5-XXL model and then uses a conditional diffusion model to generate images directly from the text encoding. Thus, Imagen does not require learning a prior model.
    Understanding Text-to-Image Models in AI Art
    By directly using the T5-XXL model, its semantic knowledge is far richer than that of CLIP (the number of image-text matching datasets is significantly less than that of purely text datasets), thus Imagen performs better in terms of semantic fidelity compared to DALL-E2. Additionally, the authors found that increasing the language model size can effectively enhance the semantic fidelity of samples.
    Stable Diffusion
    Stable Diffusion is a text-to-image generation model recently released by Stability.ai. Its simple interaction and fast generation speed have significantly lowered the barrier to use, while still maintaining impressive generation effects, thus sparking a wave of AI creation[20].
    Understanding Text-to-Image Models in AI Art
    © The images in this article were generated by the author using Stable Diffusion.
    Stable Diffusion is an improvement upon the previous Latent Diffusion model. The diffusion model mentioned above has the characteristic of a slow reverse denoising process, which occurs in pixel space, causing significant delays as the image resolution increases. The Latent Diffusion model, however, considers conducting the diffusion process in a lower-dimensional latent space, significantly reducing training and inference costs.
    Stable Diffusion consists of three parts:
    1. VAE

    Its role is to convert images into low-dimensional representations, allowing the diffusion process to occur in this low-dimensional representation. After diffusion is complete, the VAE decoder is used to decode it back into an image.

    2. U-Net Network

    U-Net is the backbone network of the diffusion model, responsible for predicting noise to achieve the reverse denoising process.

    3. Text Encoder CLIP

    Primarily responsible for converting text into representations that U-Net can understand, guiding U-Net during diffusion.

    The specific inference process of Stable Diffusion is illustrated in the figure below[19]. First, CLIP is used to convert the text into a representation, then guiding the diffusion model U-Net to perform the diffusion process in the low-dimensional representation (64×64). Finally, the diffused low-dimensional representation is fed into the VAE decoder to generate images.
    Understanding Text-to-Image Models in AI Art

    Model Trials

    Having understood the algorithmic principles behind text-to-image generation, one can also try out some open-source models. Here are some currently popular and easy-to-use model links, among which Stable Diffusion and MidJourney offer the best effects and most convenient interaction.
    VQGAN-CLIP

    https://nightcafe.studio/

    DALL-E-Mini

    https://huggingface.co/spaces/dalle-mini/dalle-mini

    DALL-E2

    https://github.com/openai/dall-e (requires waiting list

    Stable Diffusion

    https://beta.dreamstudio.ai/dream

    Disco-Diffusion

    https://colab.research.google.com/github/alembics/disco-diffusion/blob/main/Disco_Diffusion.ipynb

    MidJourney

    https://www.midjourney.com/home/

    NUWA

    https://nuwa-infinity.microsoft.com/#/ (not yet open, but stay tuned)

    Summary

    The existing text-to-image generation models are primarily based on three foundational algorithms: VQ-GAN, VQ-VAE, and diffusion models. Due to diffusion models’ ability to generate rich, diverse, and high-quality graphics, they have become the core method in the field of text-to-image generation.Currently, diffusion models face a major limitation in their widespread use due to the slow speed of generation, as each generation requires iterations. However, with the emergence of new technologies, such as Stable Diffusion using Latent Diffusion, the generation time for diffusion models has gradually decreased. It is believed that in the future, diffusion models will bring about a new transformation in the AI art generation field.
    References

    [1] An Introduction to Autoencoders

    [2] https://towardsdatascience.com/understanding-variational-autoencoders-vaes-f70510919f73

    [3] Neural Discrete Representation Learning

    [4] https://openai.com/blog/dall-e/

    [5] Zero-Shot Text-to-Image Generation

    [6] Generative adversarial nets

    [7] Taming Transformers for High-Resolution Image Synthesis

    [8] VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

    [9] https://wandb.ai/dalle-mini/dalle-mini/reports/DALL-E-mini–Vmlldzo4NjIxODA

    [10] Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    [11] NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

    [12] Denoising Diffusion Probabilistic Models

    [13] https://lilianweng.github.io/posts/2021-07-11-diffusion-models/#nice

    [14] https://huggingface.co/blog/annotated-diffusion

    [15] Classifier-Free Diffusion Guidance

    [16] GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    [17] Hierarchical Text-Conditional Image Generation with CLIP Latents

    [18] https://openai.com/dall-e-2/

    [19] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    [20] https://github.com/sd-webui/stable-diffusion-webui

    [21] https://huggingface.co/blog/stable_diffusion

    ►►►

    Registration Now Open for the Complex Science x Art Seminar Series

    Since the latter half of the 20th century, the “way of thinking” inspired by complexity research has rapidly spread to various fields of cognitive activity. Chaos, self-organization, criticality, self-creation, emergence… the richness of its conceptual layers provides us with flexible tools for studying the world. In this sense, we have reason to view complexity theory as an important topic that expands the intersection between art and science. One fundamental way in which art responds to complexity is by creating systems that exhibit “emergent behavior.” Ontologically, we no longer view artworks as static objects but rather as instances of an evolving creative process. At the same time, emerging complex sciences also provide contemporary artists with an open toolbox, including chaos, fractals, cellular automata, genetic algorithms, ant colony algorithms, artificial neural networks, L-System, and artificial life, further promoting the development of digital aesthetics, bio-art, and AI art. Complexity science not only helps us gain a deeper understanding of the generative mechanisms of consciousness and life systems but also stimulates researchers and practitioners across disciplines to collaboratively explore the potential of post-human creativity and new aesthetics, aiming to open up a more integrated creative space.
    The “Complex Science and Art” seminar, jointly initiated by the president of the Institute of Consciousness and Universe Research, popular science writer Shisanwei, art critic Wang Yanran, and curator Long Xingru, aims to gather actors and thinkers from various fields—including scientists, artists, scholars, and related practitioners—to engage in interdisciplinary knowledge discussions that transcend single disciplines, exploring the potential intersection of complexity research and humanistic art. This seminar series began in July 2022 and will be held monthly for a total of twelve sessions. Friends interested in this topic are welcome to sign up. You can join the community and access video replays.
    Understanding Text-to-Image Models in AI Art

    Seminar details and framework:

    Chaos & Muses: Complex Science x Art Seminar Series

    Recommended Reading

    • A Comprehensive Review of 100 Papers on Computational Aesthetics: How to Conduct Aesthetics from the Perspective of Complexity Science
    • Frontiers of Computational Aesthetics: Rediscovering the History of Landscape Painting through Information Theory
    • Truth and Beauty in Physics and Biology
    • The Complete Online Launch of “Zhangjiang: 27 Lectures on the Frontiers of Complexity Science”!
    • Become a Collective Intelligence VIP, unlock all site courses/book clubs
    • Join Collective Intelligence, let’s explore complexity together!
    Click “Read the Original” to join

    Leave a Comment