Deep Learning Applications in Art

Click on the “Deep Learning Lecture Hall” above to subscribe!

The Deep Learning Lecture Hall is dedicated to promoting the latest technologies, products, and activities in artificial intelligence and deep learning.

Host Liu Jinyang: First, let’s invite Teacher Wang Naiyan to give us a talk on “Applications of Deep Learning in Art”. In the past few days, we have seen examples related to painting.

Wang Naiyan: Hello everyone! My name is Wang Naiyan, from TuSimple Technology. I am very honored to be here to share some applications of deep learning in the field of art. The art field has always been considered a domain exclusive to humans, and machines have never entered this area before. However, I will discuss some amazing results we can achieve through deep learning.

I will divide this topic into three parts:

The first part is imitation. Given a painting or an image, you can imitate its style.

The second part is abstraction. We can abstract the content of deep learning and display it in a painting.

The third part is creation. Generating an image from scratch without any input.

Before we begin, I would like to briefly introduce deep learning, as it has been frequently mentioned without a simple introduction. I will take 5 minutes to give you a brief overview, including technologies like AlphaGo and image recognition applications.

The most commonly used model is called Convolutional Neural Network, abbreviated as CNN. It has two core operations: one is the convolution operation to detect features in the image, which are actually very simple things. For instance, in the first layer, we might detect features like the edges of a logo or the curves of numbers, which is essentially an edge detection filter. The other operation is called downsampling or pooling, which means merging features in a certain area to create a summary. By continuously accumulating these two operations, we build a deep learning model, which consists of a pooling layer followed by a convolution layer.

How are the features in the first layer detected? In a classic model, we can learn some first-layer filters, which are clearly black-and-white striped patterns used to detect image edges and features. This allows us to identify where the edges are in an image.

Next, I will show an example of an edge filter from the top left to the bottom right. The specific method is to calculate the correlation of this filter at every position in the image. A high correlation indicates the presence of an edge from the top left to the bottom right. We obtain a feature map where bright areas represent significant features, while areas without edges appear gray. Similarly, we can obtain such feature maps for each filter, and by combining them, we complete the crucial convolution operation.

Why do we need the convolution operation? In many applications, such as natural language processing, speech, and Go, convolutional neural network models are utilized. The key point is that these applications involve a concept of local similarity. For instance, an entire image may represent a human face, an action, a cat, or a dog, but when looking at a small region, such as a 10×10 area, you may not immediately obtain semantic information; instead, you find similar image patches. Similarly, in sentences, common phrases may occur in various positions, and the same applies to speech and Go. In a small region, such as a 5×5 area, there may be repeating concepts across the entire 19×19 Go board. Therefore, through convolution, we can reduce computational complexity by focusing only on small parts, like a 5×5 area. Additionally, since a 5×5 area can appear in various positions, we can share the 5×5 filter across the entire image. This is the core concept of convolutional neural networks and the reason they are used in many applications.

Ignore the mathematical formalism; I will explain it in simple language. Here we see a neural network composed of five layers. I have two images, and scientists have discovered that within the entire neural network, there are certain features and content. The form of this image is called Style. Looking at the image below, we can see that in different layers of the neural network, the first, second, third, and fourth layers can perfectly reconstruct the input image. Interestingly, if we design another feature and input a poster image, using the first layer features to reconstruct the Style, we see fine white, blue, and black dots, and as we abstract the neural network higher, the high-level features reconstruct the style. Although the content may differ significantly from the original image, the overall artistic style bears resemblance to the input image. The image below is actually quite similar to the original input image in terms of content.

Can we combine these two advantages? This is essentially the task of this work. Each row represents a reconstruction of different features, while each column balances between reconstructing style and content. In the image at the bottom right, we see a house image while learning from the painting style of Deyang. There are more examples where we input the same image but apply different painting styles, and we can generate and simulate differently styled painting images, which is quite interesting work.

Next, there are even more amazing results based on previous improvements. For example, I input an image of an old-style car with two curved trunks, and now I input two modern cars, an SUV and a sports car. The generated result is interesting as it retains the characteristics of the SUV and sports car while the trunk shape has changed to that of the input image.

There’s also an example of a building. If I input a building with standard windows but have a very ornate image, we can use this technology to transfer that ornate window style onto the original building image. This is the final result, and you can see that it is quite difficult to spot any flaws unless you look closely.

Recently, this work has also been extended to video. Due to the characteristics of video, if we do not consider temporal continuity, the visual effect can be very strange. Recently, a work has constrained the temporal continuity of each generated video frame through optical flow, resulting in quite stable effects; please take a look at the video.

In this part, there is also a very interesting work. So far, we have been focusing on fully automated synthesis. However, artists hope to have some manual control over content composition. Here is a NeuralDoodle, and I would like to show you a promotional video of it.

The second part is called abstraction. One question is that many people think artificial intelligence is a black box, or deep learning is a black box. In fact, many scientists have noticed this and want to explore what has been learned using some visualization methods. In this example, a neural network is trained to recognize a thousand types of objects. We want to see what content each layer has. I have drawn some results here. For instance, the first layer, as mentioned earlier, consists of edge and color detectors, and each small 3×3 grid corresponds to the pattern that activates that neuron. By combining the features of the first layer, we can observe interesting patterns, such as discovering a neuron that activates strongly for hair-like features, and some that activate for round contours. As we move up, we find that by the third layer, parts of objects begin to appear, such as wheels of cars and features of humans.

Now, if we combine these object parts with local patterns, we can see rich semantic information in the early fourth and fifth layers. There are neurons that activate specifically for flowers and those that activate for dog faces. In fact, the neural network learns semantic information in a bottom-up abstract manner, and it is not a complete black box. In each layer, for every neuron, I can identify the pattern that maximally activates it. If we visualize these patterns, we find they resemble the working patterns studied by neuroscientists in the human visual system, where the visual system activates certain cells for specific objects in a layered structure.

Next, we raise the question of whether different layers correspond to different levels of abstraction. For example, the first layer corresponds to edges and connections, the middle layer corresponds to parts, and the high layer corresponds to semantic information. Can we visually display this in an image? This is the task of the Dream work. It modifies an image to maximize the activation of a specific neuron. Here, I randomly find an image of the sky and maximize the activation of a first-layer neuron, resulting in texture information generated in the texture image. At higher levels, we see many tower-like structures, which essentially depict high-level paintings on the image, reminiscent of clouds.

Of course, we can also directly maximize a specific label, which represents the probability of a certain category. For example, if I have a tree and want it to transform into a building, I can see the tree crown taking on the shape of the house. The leaves can transform into a bird, which is also achievable through this method. This is quite interesting. The image below shows the body of a bird, the leaves, and the bird’s head generated on its own. Similarly, an image of a mountain can be transformed into a tower shape. This is, of course, a primitive method. However, if you combine these methods and use them repeatedly, along with some image processing tricks, you can create dreamlike scenes reminiscent of those in “Inception.” Among the thousand categories of data, many are dogs, birds, and various buildings, creating a dream-like effect, hence the name Deep Dream.

This is one of my favorite generated paintings, resembling a rock band where the instrument is a dog’s head, and the scene includes many cars and buildings, with a modern abstract painting style.

Last month, at the end of February, Google held an exhibition in the United States showcasing these abstract paintings, some of which sold for nearly $100,000 to support young artists in their creative endeavors.

Finally, we move to generating an image from scratch. One part is a real image, and the other is generated by a computer algorithm. Can you distinguish which part is real and which is generated? The image on the left is generated, while the one on the right is a real image. The left image has some unrealistic edges that haven’t been processed well.

How is this achieved? Essentially, it involves two convolutional neural networks engaging in a game. One is called the generator network, and the other is the discriminator network. The generator network’s task is to produce an image, while the discriminator network’s job is to determine which image is real and which is generated. Our goal is to make the generated image so realistic that it confuses the discriminator network, making it unable to distinguish between the real and generated images. This is how we achieve the goal of generating realistic images.

In the next example, I first provide a random vector, and I use this generator network to generate an image. The image on the right is real, and I finally ask the discriminator network to identify which image is real and which is fake.

Here are some very cool results. These images were generated after training on an indoor image dataset. All of these images were generated after training on approximately 300 images. At first glance, they do not seem out of place, as the bed in the bedroom, the window, and the decorations on the table are all arranged harmoniously. However, upon closer inspection, you may notice some strange details, such as some objects that are hard to identify, like the edges of the image. However, generally, it has learned what should be present in a bedroom. Remarkably, I did not tell it that a bedroom should have a window, a bed, or a quilt and pillow; I simply provided it with a number of indoor images, and it generated the output. In a sense, it is also capable of creating something.

There’s also an interesting transformation between different scenes. For example, this is one living room, and this is another scene. We can interpolate between these two scenes, resulting in a fascinating outcome.

Additionally, some very interesting results can be achieved by taking a photo of a smiling woman, subtracting a neutral woman’s photo, and adding a neutral man’s photo to generate a smiling man’s photo. This allows us to convey semantic information through generation.

Finally, I would like to share some discussions and my personal views with everyone.

First, what is creativity? Personally, I believe creativity is about combining ordinary things to generate something unusual yet reasonable. What is the strength of algorithms? They excel at identifying common patterns, such as edges and parts of objects. However, if you want to combine these common patterns into meaningful yet uncommon combinations, this is something I believe computers are not good at; humans should be responsible for this work, while algorithms are meant to reduce repetitive labor for humans.

Second, regarding the question of small data versus big data. A common viewpoint is that, for instance, if you show a person one or two pictures of cats, they can learn the concept, but machines may need thousands or tens of thousands of images to learn effectively. My viewpoint is the opposite: humans are the big data learners. Humans learn every day, starting from the first second they open their eyes, receiving signals from the external world. This is why humans possess a strong ability to generalize, allowing them to transfer knowledge from one task to another. In neural network training, two important concepts are initialization and transfer learning. Humans can recognize what a cat is from one or two pictures because they have previously seen 100,000 pictures of related concepts, such as dogs and rabbits, which they have already learned. This allows them to infer that a cat is a four-legged animal running on the ground. Therefore, in machine learning and artificial intelligence, there is research on how to transfer learned concepts to new tasks.

Lastly, do algorithms truly understand art? This question is somewhat meaningless, akin to asking whether submarines can swim. I believe it depends on how you define understanding. If you simply want to generate a painting that resembles an artist’s work, that can certainly be achieved. However, if you want to generate an atmosphere that is cheerful or melancholic, that may not yet be possible. However, as I mentioned earlier, it is not out of reach if we can transfer knowledge from other tasks.

Finally, machine learning and deep learning have already become tools for artists. In the spring of 2016, New York University launched a course recognizing that this could represent a significant asset.

Thank you all for listening.

This article belongs to the original work of the “Deep Learning Lecture Hall”. If you need to reprint it, please contact loveholicguoguo.

Author Profile

Wang Naiyan

Currently the Chief Scientist at Beijing TuSimple Technology Co., Ltd., responsible for algorithm development. Previously, he graduated in 2015 from the Hong Kong University of Science and Technology with a degree in Computer Science and Engineering. His main research focuses on computer vision and data mining, particularly applying statistical computational models to practical problems in these areas. His research emphasizes image classification, object tracking, and recommendation systems.

Past Highlights

Applications of Deep Learning in Smart Grid Image Recognition and Fault Detection

Research Progress on Object Detection Based on Deep Learning

[Brainstorm] Hinton’s Cambridge Lecture: Error Backpropagation Mechanism of Brain Neurons

[Alibaba Group Deep Analysis] Hardware Acceleration of Convolutional Neural Networks

Fully Convolutional Networks: From Image-Level Understanding to Pixel-Level Understanding

[Valse Conference Premiere] Domain Adaptation and Its Application in Face Recognition

In-Depth Article! Understanding Face Detection: From VJ to Deep Learning (Part 1)

In-Depth Article! Understanding Face Detection: From VJ to Deep Learning (Part 2)

Overview of Facial Feature Point Localization and Recent Research Progress

[Valse Premiere] Recent Advances and Practical Techniques of CNN (Part 1)

[Caffe Practical and Small Data Deep Learning] Recent Advances and Practical Techniques of CNN (Part 2)

A Brief History and Recent Advances in Facial Recognition

Welcome to follow us!

The Deep Learning Lecture Hall is dedicated to promoting the latest technologies, products, and activities in artificial intelligence and deep learning!

Deep Learning Lecture Hall

Leave a Comment Cancel reply