Understanding Deep Neural Network Design Principles

Over 200 star enterprises and 20 top investors from renowned investment institutions participated! “New Intelligence Growth List” aims to discover innovative companies in the AI field with “tenfold growth in three years“, will the next wave of AI unicorns include you? Click read the original text for details!

According to Lei Feng Network: Artificial intelligence will eliminate many jobs in the future, while also creating many jobs. Like machinery and computers, it is an extension of human capabilities, and anyone should be able to use artificial intelligence like they use machinery; it should not be a monopoly of large companies.

Deep learning can be said to be the core of various artificial intelligence technologies today. Due to concerns about being replaced by artificial intelligence, more and more people want to get started with deep learning.

However, as Ian Goodfellow mentioned, many tutorials on the market are just a long list of algorithms and do not focus on how to use these algorithms to solve current problems. These tutorials can easily mislead people into thinking that becoming an expert in machine learning means knowing all algorithms.

In fact, deep learning is still rapidly developing; today’s technology may be obsolete tomorrow. Moreover, there are numerous new network structures proposed daily. For those of us who cannot learn all technologies, we need to identify the commonalities among various network structures and attempt to classify and understand the design principles behind them.

This session of Lei Feng Network’s Hard Innovation Open Class invites Yu Jianguo, author of the book “Super Intelligent Agents”, to share content that helps understand why “deep” networks outperform “shallow” networks, what tasks deep learning is suitable for, and to attempt to find shared principles behind techniques such as feedforward neural networks, recurrent neural networks, convolutional neural networks, residual networks, pre-training, multi-task learning, end-to-end learning, autoencoders, transfer learning, distillation, dropout, regularization, and batch normalization, allowing everyone to feel how to design deep neural networks that meet their own tasks. This provides a different perspective on deep learning.

Content Introduction

This open class includes but is not limited to the following content:

Understanding what learning is
Why deep learning is efficient
The design principles of neural networks
Materials needed for beginners

Guest Introduction

Yu Jianguo, a Ph.D. student at the Human Interface Laboratory of Aizu University, conducted master’s research using deep learning to integrate lip shape data into speech recognition model training, thereby improving recognition rates without requiring lip shape data during recognition. He continued his doctoral studies because of his love for research. He hopes to share his thoughts and self-learning experiences from these years with everyone. You can find his shared content by searching for YJango on Zhihu or check out his lifelong serialized book “Super Intelligent Agents” on Gitbook, which discusses how machine learning and human brain learning work.

(The complete video of this open class is 82 minutes long)

The following is a refined compilation of the guest’s sharing content by Lei Feng Network (WeChat public account: Lei Feng Network). Follow the WeChat public account under Lei Feng Network and reply “PPT” to obtain the complete PPT from the guest.

Hello everyone, I am Yu Jianguo, a first-year Ph.D. student at Aizu University in Japan, and my master’s research was based on deep learning combined with lip shape for speech recognition. I am very happy to share with you here. Without further ado, let’s get straight to the point.

Due to the popularity of artificial intelligence, more and more people are worried that their jobs will be replaced, and therefore want to get started with deep learning, only to find it very “black box”. This time, I will share some of my personal insights regarding the design principles of deep neural networks.

Intelligence: What is Learning

There are too many things that intelligence can do, making it difficult to provide a convincing definition in one sentence.

So let’s approach it from another angle: Why do living beings need intelligence, and what is its function?

The answer is singular: to survive.

The Survival Game

Now, let’s assume that the universe has not yet produced life, and we imagine a survival game.

This survival game is similar to the mini-game in the upper right corner, which is to avoid danger and survive. It is important to note that you must clear all prior knowledge from your mind. You do not know how to play this game; all the rules are random, and it is not the case that encountering a green pillar will lead to death. A little abstraction gives us the model on the left. ABCD represents different environments; individuals will produce behaviors of approaching or distancing in response to environmental stimuli, resulting in death or survival. Here, the environment is the input x, the individual is the association f (which can be called function, mapping; I will refer to it as association later), and the behavior is the output y.

The game rules are:

1. Assume that both environments B and C are dangerous; approaching them will lead to death. A and D could also be dangerous; this is random.
2. At the same time, the individual’s association f is randomly generated by nature.

If you were the creator of nature, how would you design a life form that can self-sustain under the above conditions?

The direct brute force method is to continuously generate randomly; eventually, one individual will meet the conditions. For example, this mapping can be used, where 01 represents B, 10 represents C, 1 represents distancing, and 0 represents approaching. When B and C appear, we hope that the individual f will produce the value 1.

However, we add another rule: the environment will change. Just like in this mini-game, what if the rule changes to death if you don’t touch the pillars?

Some of you may have played “Cat Mario”; you would use your previous Mario-playing strategies, leading to various deaths. So when the environment changes to A and D being dangerous, the individual will die.

Thus, even if an individual is very lucky to generate an association f that meets survival, it will become inanimate when the environment changes next time. If we continue to use random generation, life will forever remain at the initial stage for only a brief moment, unable to advance to the next stage.

Therefore, for life to continue, it needs a capability, the ability to adapt to change. This is also how Hawking describes intelligence: Intelligence is the ability to adapt to change.

Thus, nature utilizes a large number of individuals to reproduce continuously, and this reproductive process is not a perfect copy but will produce mutations. The mutated individuals have the opportunity to adapt to the changed environment and survive, while individuals that cannot adapt will be filtered out. The filtered individuals continue to reproduce in large numbers, creating diversity to face the next round of environmental filtering. The result of filtering is evolution. Evolution produces associations suitable for the current environment. Using this dynamic cycle of reproduction, mutation, and selection, relatively stable life forms can be formed.

The key to the game is that the speed of updating associations > the speed of environmental changes.

Thus, the behavior of many organisms in nature to reproduce in large numbers is not wasteful; it increases the number of samples available for selection, preventing all individuals from being filtered out after environmental changes.

Origin of Life

This is a video explanation about RNA as the origin of life; I won’t describe it. Interested friends can download my PPT to watch it themselves.

Core of Survival

This image roughly shows the relationship between several concepts. A large number of intelligent associations are stored by DNA, and the replication of DNA produces mutations, creating diversity. This diversity is reflected in individuals, and mutated individuals are selected by the environment, transforming the population and consequently filtering the intelligent associations.

Organizing the above content leads to the following conclusions:

1. The object of evolution is not the individual, nor the genes, but the intelligent associations. The life and death of individuals only serve to update the population. Genes are like hard drives, and the intelligent associations stored on them are the core. The process of evolution is the continuous filtering of associations. Filtering out associations that fit the current environment.

2. The process of finding association f is also learning. The dynamic process of natural selection is a learning method at the population level.

3. Intelligence is the ability to adapt to change, and the core parts of intelligence include:

Learning, the search for associations, corresponds to natural selection in lower life forms.
Continuation, the storage of associations, allowing learned associations to persist. Other planets may have produced life, but there has not been a medium that can stably exist on that planet while also being able to continue and replicate itself. Therefore, in the movie “Prometheus”, DNA is referred to as the seed left by aliens on Earth.
Finally, decision-making, the application of associations, generating different behaviors based on learned associations to avoid danger.

The associations stored on DNA are like blueprints for tools, and various proteins are the tools made based on those blueprints. The blueprints are drawn by natural selection.

You will find that this is different from the intelligence you know; it seems overly simplistic. Because the above description only stays at the level of proteins, which represent low-level intelligence. Life that survives solely by this method is a virus, with only a protective protein shell and DNA or RNA, capable of self-replication. However, the tasks that a single protein can accomplish are very limited. To enhance the ability to adapt to change, a large number of proteins are combined to work in parallel, reaching the level of cellular intelligence. Similarly, many cells form tissues, then organs, systems, individuals, and groups; the more you go up, the more complex the tasks that can be accomplished.

I want to use the following video to let everyone feel that the internet thinking praised in recent years has always existed within us. Only after technology has enhanced our connection speed has it become prominent. We ourselves are like factories, with a large number of proteins working in parallel to complete various physiological functions. In the internet age, we are essentially no different from the proteins within us. You are not a single life.

Nature cannot suddenly produce particularly complex functions; higher intelligence is generated by the iteration of lower intelligence. Although a large number of organisms can survive well using low-level intelligence, we still want to understand the principles of advanced intelligence to serve ourselves, as our environment is more complex.

The most representative of advanced intelligence is the ability to think. However, the principles of thinking are difficult to decode.

Still using the old method, we turn to ask why consciousness evolved to allow us to think and what problems it can solve?

Movement Issues

Because environmental changes are random, evolution has no direction but has a trend of increasing diversity. Do you remember the key to the survival game? It is that the speed of updating associations must be greater than the speed of environmental changes. Increasing diversity increases the amplitude of updates.

Through self-replication, the diversity that can be produced is still weak. Therefore, nature gradually began to form sexual reproduction, allowing two different individuals to mate and increase the diversity of offspring.

However, sexual reproduction brings a problem: if life cannot move, individuals can only reproduce with nearby individuals, which limits the diversity of intelligent associations in that area; the diversity capability of sexual reproduction is restricted.

Therefore, to shuffle the deck, large-scale movement becomes a necessity. Even immobile organisms like plants use fruits to help animals spread their seeds.

But large-scale movement also brings another problem: the environment will change with movement, and the associations learned by individuals in the previous environment may not apply in the next environment. For example, if you place tropical plants in the desert, the different environment will kill them.

Predictive Models

At this point, a new learning method is needed; obviously, using natural selection is not feasible.

Because the increase in movement, sensory, and other abilities lengthens the growth cycle of individuals. Imagine, if it takes several months for a young animal to be born, and it takes a wrong step and falls into a pit, it would be extinct in a few rounds. They cannot afford the costs of trial-and-error learning through natural selection.

Thus, for large-scale mobile organisms, they need to add another capability based on existing intelligence: prediction.

Association f is no longer a simple stress response; it is no longer an association from environment to behavior. Instead, it is an association from past events to future events. Organisms will use consciousness to simulate their relationship with the environment in their brains, predicting what will happen next, or even in the next few steps, to make decisions. This is akin to playing chess.

Neurons

Neurons have a large number of proteins inside that can control the ingress and egress of ions, thus gaining the ability to control electricity. They represent different states using different frequencies, allowing the neural network in the brain to simulate the states and changes of the environment.

This allows life to learn the associations between any two spaces at the individual level, no longer relying on natural selection as a learning method at the population level.

Visual Perception

Decision-making requires a basis, so life must have perception abilities to sense information around it. Taking image recognition, which deep learning excels at, as an example, let’s see what it is actually doing and further understand the concept of “establishing associations between two spaces”.

Here are four differences between the vision of different organisms and that of humans. This leads us to a bit of knowledge: the human eye does not see the world as it is; rather, it perceives it in a way that is suitable for its survival.

Because the prey of snakes usually act at night, they evolve a perception system that works well at night: thermal sensing.

Any visual system associates reflection with the concepts seen in the “brain”. The same reflection can yield different perceptions through different visual systems.

Therefore, image recognition is not about identifying what this object is; rather, it learns to find the human visual association method and applies it again. If we were not humans but snakes, the f that image recognition seeks would be different from what it is now.

When x represents an image, and y represents the concept of the image in the brain, the neural network completes image recognition. When x represents spoken words, and y represents what will be spoken, the neural network completes a language model. When x represents English, and y represents Chinese, the neural network completes machine translation.

The neural network seeks to find the association that explains the relationship between these two spaces from many examples of input to output. Just like the linear equation y=ax+b, giving you two examples to determine a and b. Once determined, you can use the established association in future activities to obtain the desired y from specific x. However, the associations in nature are not as simple as linear equations.

The Role of Consciousness

And the consciousness that represents advanced intelligence allows us to simulate in our brains what will happen next, thereby deciding the next action.

A person’s life is a continuous process of establishing various predictive models in their environment, forming a world model. You can call it a worldview.

What happens when a teacup falls, or what happens when there is a gunshot? When a person is at a height, they predict the consequences of falling, hence the fear.

As the environment changes, the established predictive models need to be updated. For example, the environment we live in now is different from that of ancient people, with investment projects, changes in stock prices, housing prices, and exchange rates in the coming months.

Thus, you can see that as living beings, we are always adapting to this ever-changing world.

The content of intelligence is association, with the core being learning. However, it is not limited to this; it also includes other abilities related to associations and learning.

Intelligence includes not only the establishment of associations but also when to collect data, as we learn from historical experiences.

Also, when to update established models, and when to use which model. All of these are part of intelligence.

For example, the prediction of the human brain is actually bidirectional; it constantly predicts what will happen next and compares the actual events with its predictions. Usually, this process does not attract your conscious attention. Only those events that conflict with your predictions will be noticed. For example, you do not pay attention to the stairs you walk every day, but when a step suddenly rises by 3 centimeters, you easily notice it. Events that conflict with your established model are more easily remembered and collected as training data for future learning, allowing you to better predict and survive. Therefore, the purpose of thinking is to predict.

And a person’s life always requires learning because the world is constantly changing. There is no saying that children learn better than adults. The reason for this phenomenon is that people rely on their world model. Children have not constructed a complete world model, so the nature evolved in their DNA makes them curious about everything when they are young, wanting to learn everything. Adults have already established a relatively complete world model, which needs a protective mechanism to prevent being fooled. If you remain like a child, learning everything and updating everything, you can easily be brainwashed.

However, adults also update established models. For example, persuading an adult is better done by describing an event; they will unconsciously predict the outcome. But when you tell them their prediction is wrong, they receive the signal that “their existing model is unreliable” and turn off the protective mechanism against learning to update the existing model.

Intelligence is always executed in parallel; only consciousness cannot appear in two places at the same time. One reason is that consciousness must decide at a certain moment which association to apply. Our limbs have various associations, some for cycling, some for running, and consciousness plays a regulatory role.

Our current artificial intelligence has not yet reached the level of consciousness; it merely establishes associations between two spaces, so image recognition and speech recognition can be performed well. However, we will gradually move towards consciousness based on this foundation.

Prerequisite Knowledge

Artificial Intelligence

The three core parts of intelligence: learning, storage, and application, have their own realizations in nature.

Artificial intelligence aims to implement this ability in machines. For example, we do not rely on proteins but on machines to apply associations; we do not rely on DNA but on computers to store learned associations; we do not rely on natural selection but on machine learning algorithms to establish associations. Everyone’s goal is to make decisions for better survival.

So what knowledge is needed to achieve this goal?

The world is constantly changing, moving from one state to another. This involves two concepts: state and change.

So how do we accurately describe state and change?

We, who evolved to perceive three-dimensional space, are accustomed to describing objects in three-dimensional space. However, in addition to length, width, and height, there are many factors that determine the state of things. For example, factors that determine stock prices and weather are not just three.

Our world may not even be three-dimensional; it is merely because our three-dimensional spatial perception is sufficient for us to survive well. We have not evolved higher-dimensional perception abilities.

But how to reasonably describe these high-dimensional states and changes?

Linear algebra is the discipline used to describe state and changes in any dimensional space, and matrices in linear algebra serve as the medium for storing state and change information.

Through linear algebra, we learn how to describe the state of things and their changes. Unfortunately, for a tiny organism, much information is often missing, and we cannot be 100% certain about what state things will reach after changes. Even the underlying world is built purely on randomness. Therefore, we need probability to help us predict future states and make reasonable decisions under such circumstances.

At the same time, since we want to implement intelligence on computers, we need to understand how to control the matrices that store states and changes on computers.

Why Deep Learning is Efficient

Learning Difficulties

Now that we know what learning is, let’s look at where the difficulties in learning lie.

This part of understanding determines your design philosophy of neural networks. Learning requires training an association f from historical experience to solve new problems. Take the college entrance examination as an example; we train our problem-solving method f by doing past exam papers. The training method involves continuously doing questions and comparing with correct answers. When the college entrance examination comes, when we see a new question x, we hope to use our trained association f to obtain the correct answer y. What we really care about is not the past exam papers we have done, but the college entrance examination.

The difficulty in learning lies in performing well on unseen tasks.

Extreme Cases

Let’s consider an extreme case. Suppose there are only 4 true-false questions in the college entrance examination. Then you only need to remember the answers to those 4 questions to score full marks. However, the actual situation is that the possible questions for the college entrance examination are infinite, while the past exam papers we can train on are limited. For example, to recognize images of cats, cats can have various shapes, expressions, colors, and sizes, with all kinds of variations. We cannot exhaustively list all the cats; how can we train an association f that can effectively judge whether an image is a cat from a limited sample of images?

Learning requires finding a reasonable association f from limited examples. One direction is to train more data, to see more situations. For instance, some students use a strategy of doing a lot of questions. This is the role that big data has played in artificial intelligence in recent years.

However, relying solely on big data is insufficient. Another direction is exemplified by those top students who can grasp the core of the problem after doing only one or two questions. This is actually the key to deep learning surpassing other machine learning algorithms in natural tasks: incorporating prior knowledge and adjusting the hypothesis space.

Of course, the more data there is, the better for learning, but to understand why relying solely on big data is not enough, we need to grasp three issues.

First: The Curse of Dimensionality.

The first direction of learning mentioned above is to see more examples. However, as the dimensionality increases, the scenarios increase, making it impossible to see all situations.

Considering only discrete simple tasks, with 1 dimension there are 10 situations, with 2 dimensions there are 100 situations, and with 3 dimensions there are 1000 situations. As the dimensionality increases, we become less able to see all situations. Now, a common task can have hundreds of dimensions and is continuous data.

How to predict data that has not been seen? Traditional methods rely on the assumption that data is smooth, meaning that a value does not differ much from its neighbors. Thus, when encountering unseen data, we take the average of two encountered data values on either side. However, in high-dimensional cases, this method is very unreliable because it does not meet this assumption; the data is not smooth but has sharp fluctuations.

To make reliable predictions, more data is needed, and it must be different sample data. In reality, even with big data, this is difficult to achieve. Friends who have played “Hearthstone” can think about how much money they would need to collect all the cards purely by buying card packs.

Second: Finding Association f.

We rely on historical data to train association f, but the association f that explains historical data is not unique. For example, if I want two numbers to add up to 1, I can make one number 1 and the other 0; I can also make one number -299 and the other 300. Both can accomplish the task. This leads to the situation where the association f we seek may perfectly explain the training data but fails to guarantee perfect predictions in new forecasts.

For instance, in the two images, the association f learned on the left can perfectly predict the training set. However, when applied to the test set, the red part predicts incorrectly. What we actually want is a very regular spiral shape.

Similarly, in the college entrance examination, there are many ways to solve problems. Some are very clever, but these clever methods may only apply to specific questions. Other questions may not be suitable anymore. A student may find a method that can solve all the questions they have done but cannot guarantee that this method will also be effective in the college entrance examination.

Third: No Free Lunch.

This also leads us to the no free lunch theorem. Since association f is infinite, searching for association f is like searching for a needle in a haystack. Different machine learning algorithms are merely better at fishing in certain waters. If you focus on fishing in a particular area, then other areas will be neglected.

Therefore, if the association f you want to fish for can exist anywhere in the sea, then deep learning will not be superior to other machine learning algorithms. In this case, any two machine learning algorithms are equivalent.

Does this mean that deep learning is not excellent? Not necessarily. Because in many tasks in nature, the association f is not like I previously mentioned, appearing anywhere in the sea; rather, it concentrates in specific areas, those that conform to natural physical phenomena. Deep learning is adept at fishing in those specific areas.

Deep learning is what I previously mentioned as the second direction of learning, incorporating prior knowledge and adjusting the hypothesis space.

How to understand the incorporation of prior knowledge? For example, if you ask your friend to guess what you are thinking, that difficulty is considerable because you can think of anything. Your friend typically asks you to give him a range. For example, food, or they can narrow it down to fruits. This way, they do not need to guess randomly among all things, making it easier for them to guess.

During World War II, the German Enigma machine could generate thousands of codes to transmit operational information. The British military hired mathematicians, including the father of artificial intelligence, Turing, to directly decipher its encryption principles. They did not assume that this data was smooth. Machine learning should use the same idea, directly thinking about how the data is generated. Many tasks we face are generated by nature’s code generators. Turing and others relied on the characteristic that no same letters appear between any plaintext and ciphertext to crack the code.

So what characteristics does nature’s data have?

Distributed Representation

This introduces the first piece of prior knowledge about nature: parallel combination. This is the idea of distributed representation in deep learning. Suppose we have 8 different apples; using general learning methods, we would need to see 8 different situations to learn perfectly. But if you tell me these different apples are formed by combinations of color, size, and shape, and if each factor has two situations, then we only need to learn these six situations. There is no need to see 8 variants. You will find that by incorporating the prior knowledge that variations are formed by different factors, we reduce the amount of data needed for learning. The ellipse also has variations; what kind of shape is an ellipse? We can continue to decompose this thought process to further reduce the amount of training data needed.

Neural Networks

Taking face recognition as an example, various faces are composed of features, and various features are formed by different shapes, which are in turn formed by pixels. We can continue to break this down to reduce the number of sample data needed for training.

However, this method of decomposition has a premise. We know that the things in this world are formed through combinations; one carbon atom + two oxygen atoms form carbon dioxide, while three oxygen atoms form ozone. At the same time, it is not a linear combination. Therefore, we look at the basic transformation formula of neural networks, where each layer combines the factors within x and then applies a nonlinear transformation to yield a y, mimicking the laws of data generation in nature. Training a neural network involves providing a large number of x and corresponding y, learning W and b. Just like the linear equation y=ax+b, giving you two examples to solve for a and b. In the first part, we also mentioned that the human body is formed by this combination method, so neural networks are very suitable for image recognition and speech recognition.

If in a completely different physical law world, the things and our bodies may not be formed in this combinatorial manner, then the associations learned using this decomposition method will not be able to effectively complete the data generation task, and the predictions will not be reliable.

However, up to now, what has been described is merely why shallow neural networks are excellent. A neural network with a single hidden layer can fit any training data, as long as it has enough hidden nodes.

But why is deep learning superior to shallow learning? This can actually be felt at the position of the ellipse, which is to decompose the factors again to further reduce the amount of training data needed.

However, this actually introduces a second piece of prior knowledge: iterative transformation.

We know that atoms form molecules, and things are further iterated based on the molecules formed, not regenerated from atoms. Airplanes are formed from atoms to molecules and then to various components. Tanks also utilize the same molecular layer. Although as images, tanks and airplanes are different samples, they share the same molecular layer. This means that when you train on airplane samples, you indirectly train on tanks, reducing the amount of training data needed.

In the right two images, the left image is the neural network connection diagram, while the right image is the relationship diagram between different variants. The linked circles represent different values that a node can take, while separated circles represent different nodes.

If a neural network with a single hidden layer is used for learning, then each variant is only decomposed into independent factors serving its own purpose, and will not have an effect on other samples.

However, if a deep network is used, as shown in the lower image, it is easy to form sharing among the factors a, b, and c. Thus, when training the sample (3,0), it indirectly trains all other samples that share a, b, and c.

Moreover, programming is done in a modular fashion; we do not write programs directly in one go but break them into many small modules. These small modules can be applicable under different needs, meaning they share. Thus, we do not need to rewrite each time a slight change occurs.

The network described above is called a deep feedforward neural network. Transformations occur layer by layer, without skipping between different layers, and combinations occur only between factors within the same layer.

Imagine if a node in a network can connect with any other node; such a network has no focus. It would be like being told you are looking for someone who is everywhere and nowhere at the same time. This is equivalent to not incorporating prior knowledge, not narrowing down the search for association f.

Thus, the connection method of feedforward neural networks effectively reduces the amount of data needed for training because this combination method aligns well with the physical formation laws of nature.

Therefore, deep learning does not excel in all tasks, but in many naturally formed data sets, it surpasses other machine learning algorithms.

Deep Neural Networks

At this point, let’s revisit the relationship between life and the environment. The environment is becoming increasingly complex, but complexity increases based on existing factors, generating various variants according to physical laws. For example, if there are n factors, each factor can have two different states, then the number of variants that can be formed will be 2 raised to the power of n.

Life learning is about decomposing these variants into factors and learning an association from them, and this association is knowledge.

The deep feedforward neural network we are currently discussing incorporates two inherent pieces of prior knowledge from nature:

Parallel: New states are formed by the parallel combination of several old states.
Iterative: New states can be formed again from already formed states.

Feedforward neural networks can be applied to almost all tasks, but they are very general, and the prior knowledge they provide is not targeted.

If the prior knowledge is not sufficiently targeted, the amount of data required for training will increase, and after a certain depth, the rules of noise formation will also be learned into the model, which are not what we want.

Other variants of neural networks, such as recurrent neural networks and convolutional neural networks, provide much more targeted prior knowledge, which can reduce the area of the search for associations and exclude additional interference caused by noise rules.

The differences between neural network variants lie in the different prior knowledge you have incorporated.

I hope to use this video to help everyone feel the presence of these two pieces of prior knowledge, parallel combinations and iterative transformations, in nature.

Applications: Design Philosophy

After laying out so much, we finally arrive at the core part. Now that we know why deep learning is efficient, we have corresponding guidance on how to design networks.

Basic Philosophy

First, two points must be clarified:

Deep learning is not omnipotent; the premise of using deep learning is that your data can utilize such prior knowledge. Otherwise, it is like using clever methods for solving English questions to solve math problems.
Secondly, deep learning does not have a fixed form; do not think that a recurrent neural network is just a recurrent neural network, and a convolutional neural network is just a convolutional neural network. If you learn neural networks this way, you will never finish learning in your lifetime because the ways networks can connect are infinite. You need to grasp at least two cores: decomposing factors and sharing factors.

The nodes within each layer represent factors, and these factors collectively describe a certain state of things. These states can develop layer by layer, and the developed state can be decomposed and merged again for the next state transition.

You can think of a box as a neural network, and the neural network can continue to form deeper neural networks with other neural networks. For example, the output processed by the convolutional layer can be further processed by the recurrent layer.

In the right image, the second stage’s factors are provided by three neural networks, and the factors between different neural networks can be added together or merged into higher-dimensional states.

Designing neural networks is more like playing with LEGO blocks, but the rules of play involve how to decompose factors and how to enable different samples to share factors. So when you see many new network structures, be sure to consider how their structures consider factor decomposition and factor sharing.

Although everyone is accustomed to calling them recurrent neural networks and convolutional neural networks, please understand them as

Leave a Comment Cancel reply