Understanding Neural Networks, Manifolds, and Topology Through 18 Images

Source | OSCHINA Community

Author | OneFlow Deep Learning Framework

Original link: https://my.oschina.net/oneflow/blog/5559651

Understanding Neural Networks, Manifolds, and Topology Through 18 Images

So far, one major concern about neural networks is that they are difficult to interpret black boxes. This article primarily aims to theoretically understand why neural networks perform so well in pattern recognition and classification, fundamentally distorting and transforming the original input through layers of affine transformations and nonlinear transformations until different categories can be easily distinguished. In fact, the backpropagation algorithm (BP) continuously fine-tunes this distortion effect based on training data. This article uses multiple animations to vividly explain how neural networks work, and related content can also be referenced in discussions by Zhihu users:https://www.zhihu.com/question/65403482/answer/2490040491

Author | Christopher Olah

Source | Datawhale

Translation | Liu Yang

Proofreading | Hu Yanjun (OneFlow)

About ten years ago, deep neural networks achieved breakthrough results in fields such as computer vision, attracting great interest and attention.

However, some people still express concerns. One reason is that neural networks are black boxes: if a neural network is well-trained, it can yield high-quality results, but it is difficult to understand how it works. If a neural network fails, it is also hard to pinpoint the problem.

While it is difficult to understand deep neural networks as a whole, we can start with low-dimensional deep neural networks, which have only a few neurons per layer and are much easier to understand. We can use visualization methods to understand the behavior and training of low-dimensional deep neural networks. Visualization methods allow us to intuitively grasp the behavior of neural networks and observe the connection between neural networks and topology.

Next, I will discuss many interesting things, including the lower bound of the complexity of neural networks that can classify specific datasets.

A Simple Example

Let’s start with a very simple dataset. In the figure below, two curves on the plane are composed of countless points. The neural network will try to distinguish which points belong to which line.

To observe the behavior of the neural network (or any classification algorithm), the most direct method is to see how it classifies each data point.

We start by observing the simplest neural network, which has only an input layer and an output layer. Such a neural network simply separates the two classes of data points with a straight line.

This neural network is too simple and crude. Modern neural networks typically have multiple layers between the input layer and the output layer, known as hidden layers. Even the simplest modern neural networks have at least one hidden layer.

A simple neural network, image source Wikipedia

Similarly, we observe the operations performed by the neural network on each data point. It can be seen that this neural network separates the data points with a curve instead of a straight line. Clearly, curves are more complex than straight lines.

Each layer of the neural network represents the data with a new representation. We can observe how the data is transformed into new representations and how the neural network classifies them. In the representation of the last layer, the neural network draws a line between the two classes of data to distinguish them (if in higher dimensions, it would draw a hyperplane).

In the previous visualization graphics, we saw the original representation of the data. You can think of it as how the data looks in the “input layer”. Now let’s look at how the data appears after transformation; you can think of it as how the data looks in the “hidden layer”.

Each dimension of the data corresponds to the activation of a neuron in the layer of the neural network.

The hidden layer represents the data in such a way that it can be separated by a straight line (i.e., linearly separable).

Continuous Visualization of Layers

In the previous method, each layer of the neural network represents the data with different representations. As a result, the representations between each layer are discrete and not continuous.

This creates difficulties for our understanding; how does the transformation occur from one representation to another? Fortunately, the properties of layers in neural networks make this understanding very easy.

There are various different layers in neural networks. Below, we will discuss the tanh layer as a specific example. A tanh layer includes:

Linear transformation using the “weight” matrix W
Translation using vector b
Pointwise application of tanh

We can consider this as a continuous transformation, as shown below:

The situation is roughly the same for other standard layers, consisting of affine transformations and pointwise applications of monotonic activation functions.

We can use this method to understand more complex neural networks. For example, the neural network below uses four hidden layers to classify two spirals that are slightly entangled. It can be seen that in order to classify the data, the representation of the data is continuously transformed. The two spirals were initially entangled, but by the end, they can be separated by a straight line (linearly separable).

On the other hand, the neural network below, although it also uses multiple hidden layers, cannot separate two spirals that are more deeply entangled.

It should be noted that the above two spiral classification tasks have some challenges because we are currently using only low-dimensional neural networks. If we use a wider neural network, everything would be much easier.

(Andrej Karpathy created a great demo based on ConvnetJS that allows people to explore neural networks interactively through this visualization.)

The Topology of the Tanh Layer

Each layer of the neural network stretches and compresses space, but it does not shear, cut, or fold space. Intuitively, neural networks do not destroy the topological properties of the data. For example, if a set of data is continuous, then it remains continuous after being transformed into a new representation (and vice versa).

Such transformations that do not affect topological properties are called homeomorphisms. Formally, they are bijections of continuous functions that are both continuous in both directions.

Theorem: If the weight matrix W is non-singular, and a layer of the neural network has N inputs and N outputs, then the mapping of this layer is a homeomorphism (for specific domains and ranges).

Proof: Let’s go step by step:

1. Assume W has a non-zero determinant. Then it is a bijective linear function with a linear inverse. Linear functions are continuous. Therefore, a transformation like “multiplying by W” is a homeomorphism;

2. The “translation” transformation is a homeomorphism;

3. Tanh (as well as sigmoid and softplus, but not including ReLU) is a continuous function with continuous inverses (for specific domains and ranges). They are bijections, and their pointwise applications are homeomorphic.

Thus, if W has a non-zero determinant, this layer of the neural network is a homeomorphism.

If we combine such layers arbitrarily, this result still holds.

Topology and Classification

Let’s look at a two-dimensional dataset that contains two classes of data A and B:

A is red, B is blue

Note: To classify this dataset, the neural network (regardless of depth) must have a layer with 3 or more hidden units.

As mentioned earlier, using sigmoid units or softmax layers for classification is equivalent to finding a hyperplane (in this case, a straight line) in the representation of the last layer that separates A and B. If there are only two hidden units, the neural network cannot separate the data topologically in this way and will fail to classify the above dataset.

In the visualization below, the transformations of the hidden layer change the representation of the data, and the line is the dividing line. It can be seen that the dividing line continuously rotates and moves but is always unable to effectively separate the two classes of data A and B.

This neural network, no matter how trained, cannot perform the classification task well.

In the end, it can only barely achieve a local minimum, reaching an 80% classification accuracy.

The above example has only one hidden layer, and due to having only two hidden units, it will fail in classification regardless of the circumstances.

Proof: If there are only two hidden units, either the transformation of this layer is a homeomorphism, or the weight matrix of the layer has a determinant of 0. If it is a homeomorphism, A is still surrounded by B, and cannot be separated by a straight line. If the determinant is 0, then the dataset will fold along some axis. Since A is surrounded by B, any folding of A along any axis will cause some A data points to mix with B, making it impossible to distinguish A from B.

However, if we add a third hidden unit, the problem is solved. At this point, the neural network can transform the data into the following representation:

At this point, it can be separated by a hyperplane.

To better explain the principle, here is a simpler one-dimensional dataset:

To classify this dataset, a layer consisting of two or more hidden units must be used. If two hidden units are used, a nice curve can represent the data, which can then be separated by a straight line:

How is this achieved? When

is activated, one hidden unit is activated; when

is activated, another hidden unit is activated. When the first hidden unit is activated while the second hidden unit is not, it can be determined that this is a data point belonging to A.

The Manifold Hypothesis

Does the manifold hypothesis make sense for handling real-world datasets (like image data)? I think it does.

The manifold hypothesis states that natural data forms low-dimensional manifolds in their embedded spaces. This hypothesis is supported by both theoretical and experimental evidence. If you believe the manifold hypothesis, then the task of classification algorithms can be reduced to separating a set of entangled manifolds.

In the previous example, one class completely surrounds another class. However, in real-world data, the manifold of dog images is unlikely to be completely surrounded by the manifold of cat images. However, other more reasonable topological situations may still cause problems, which will be discussed in the next section.

Links and Homotopy

Next, I will discuss another interesting dataset: two linked tori, A and B.

Similar to the datasets we discussed earlier, if we do not use n+1 dimensions, we cannot separate an n-dimensional dataset (n+1 dimensions in this case correspond to the 4th dimension).

The linking problem belongs to knot theory in topology. Sometimes, we see a link and cannot immediately determine whether it is an unlink (unlink means that although they are entangled, they can be separated by continuous deformation).

A simple unlink

If a neural network with only three hidden units can classify a dataset, then that dataset is an unlink (the question arises: theoretically, can all unlinks be classified by a neural network with only three hidden units?).

From the perspective of knot theory, the continuous visualization of the data representations generated by the neural network is not only a good animation but also a process of untying links. In topology, we refer to this as the ambient isotopy between the original link and the separated link.

The ambient isotopy between manifold A and manifold B is a continuous function:

Each

is homeomorphic to X.

is the characteristic function,

maps A to B. In other words,

continuously transitions from mapping A to itself to mapping A to B.

Theorem: If the following three conditions are satisfied simultaneously: (1) W is non-singular; (2) the order of the neurons in the hidden layer can be manually arranged; (3) the number of hidden units is greater than 1, then there is an ambient isotopy between the input and the representations produced by the neural network layer.

Proof: Let’s go step by step:

1. The hardest part is the linear transformation. To achieve a linear transformation, we need W to have a positive determinant. Our premise is that the determinant is non-zero; if the determinant is negative, we can convert it to positive by swapping two hidden neurons. The space of matrices with positive determinants is path-connected (path-connected), so there is

, thus,

. Through the function

, we can continuously transition the characteristic function to the W transformation, multiplying x by the continuously transitioning matrix

at each point.

2. It is possible to transition from the characteristic function to the b translation through the function

3. It is possible to transition from the characteristic function to

through pointwise application.

I guess some might be interested in the following question: Can a program be developed to automatically discover such ambient isotopies and also automatically prove the equivalence of certain different links or the separability of certain links? I would also like to know if neural networks can outperform current SOTA techniques in this regard.

Although the types of links we are discussing may not appear in real-world data, real data may have higher-dimensional generalizations.

Links and knots are both 1-dimensional manifolds, but they require 4 dimensions to separate them. Similarly, to separate n-dimensional manifolds, higher-dimensional spaces are needed. All n-dimensional manifolds can be separated using 2n+2 dimensions.

A Simple Method

The simplest method for neural networks is to directly pull apart the entangled manifolds and to stretch the parts that are intertwined as much as possible. Although this is not the fundamental solution we seek, it can achieve relatively high classification accuracy, reaching a relatively ideal local minimum.

This method can lead to very high derivatives in the regions being stretched. Addressing this requires a shrinkage penalty, which penalizes the derivatives of the layers of data points.

Local minima are of no use in solving topological problems, but topological problems may provide good ideas for exploring solutions to the aforementioned issues.

On the other hand, if we only care about achieving good classification results, then if a small part of a manifold is entangled with another manifold, is that a problem for us? If we only care about classification results, then this does not seem to be an issue.

(My intuition suggests that such shortcut methods are not good and can easily lead to dead ends. Especially in optimization problems, seeking local minima does not truly solve the problem, and if a solution that does not genuinely resolve the issue is chosen, good performance will ultimately not be achieved.)

Selecting Neural Network Layers More Suitable for Manipulating Manifolds?

I believe standard neural network layers are not suitable for manipulating manifolds because they use affine transformations and pointwise activation functions.

Perhaps we could use a completely different type of neural network layer?

One idea that comes to mind is to first let the neural network learn a vector field, where the direction of the vector field is the direction in which we want to move the manifold:

Then deform the space based on this:

We can learn the vector field at fixed points (just select some fixed points from the training set as anchors) and interpolate in some way. The form of the vector field above is as follows:

where and are vectors, and and are n-dimensional Gaussian functions. This idea is inspired by radial basis functions.

K-Nearest Neighbor Layer

My other viewpoint is that linear separability may be an overly high and unreasonable requirement for neural networks, and perhaps using k-nearest neighbors (k-NN) would be better. However, the k-NN algorithm largely relies on the representation of the data, so a good data representation is needed for k-NN to yield good results.

In the first experiment, I trained some MNIST neural networks (two-layer CNN, no dropout) with an error rate of less than 1%. Then, I discarded the final softmax layer and used the k-NN algorithm, and multiple results showed that the error rate dropped by 0.1-0.2%.

However, I feel that this approach is still not correct. The neural network is still trying to perform linear classification, but because the k-NN algorithm is used, it can slightly correct some of the errors it made, thus reducing the error rate.

Due to the weighting of (1/distance), k-NN is differentiable with respect to the data representation it acts upon. Therefore, we can directly train the neural network for k-NN classification. This can be viewed as a “nearest neighbor” layer, which functions similarly to a softmax layer.

We do not want to feedback the entire training set for each mini-batch, as this would be too costly. I think a good approach is to classify each element in the mini-batch based on the classes of other elements in the mini-batch, assigning each element a weight of (1/(distance to the classification target)).

Unfortunately, even using complex architectures, the k-NN algorithm can only reduce the error rate to 4-5%, while the error rate is even higher with simpler architectures. However, I did not put much effort into hyperparameters.

But I still like the k-NN algorithm because it is more suitable for neural networks. We want points on the same manifold to be closer to each other, rather than fixating on separating manifolds with hyperplanes. This is equivalent to shrinking individual manifolds while enlarging the space between different category manifolds. This simplifies the problem.

Conclusion

Certain topological properties of data may prevent these data from being linearly separable using low-dimensional neural networks (regardless of the depth of the neural network). Even in technically feasible situations, such as spirals, it is very difficult to achieve separation with low-dimensional neural networks.

To accurately classify data, neural networks sometimes require wider layers. Additionally, traditional neural network layers are not suitable for manipulating manifolds; even when manually setting weights, it is challenging to achieve ideal data transformation representations. New neural network layers may serve as good auxiliary tools, especially new layers inspired by understanding machine learning from the perspective of manifolds.

(Original translation: https://mp.weixin.qq.com/s/Ph2DADMGzi-HC4lIuU5Byw;

Original:http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/)

END

Do you know who the Queen of Open Source is?

Here is the latest open source news, software updates, technical dry goods, and more

Click here ↓↓↓ Remember to follow ✔ and star ⭐~

Leave a Comment Cancel reply