Recently, I have been looking at some dimensionality reduction algorithms. This article first provides an information table summarizing seven algorithms, including the (hyper)parameters that can be adjusted for each algorithm, the main objectives of the algorithms, and so on. Then, it introduces some basic concepts of dimensionality reduction, including what dimensionality reduction is, why it is necessary, and how it can solve the curse of dimensionality. Next, it analyzes the perspectives from which dimensionality reduction can be approached, and finally organizes the specific processes of these algorithms. The main directory is as follows:
-
1. Basic Concepts of Dimensionality Reduction
-
2. Perspectives for Dimensionality Reduction
-
3. Dimensionality Reduction Algorithms
-
3.1 Principal Component Analysis (PCA)
-
3.2 Multidimensional Scaling (MDS)
-
3.3 Linear Discriminant Analysis (LDA)
-
3.4 Isometric Mapping (Isomap)
-
3.5 Locally Linear Embedding (LLE)
-
3.6 t-SNE
-
3.7 Deep Autoencoder Networks
-
4. Summary
-
5. Code Appendix
As usual, let’s first present a comparison table of various dimensionality reduction algorithms. X represents the size of the high-dimensional input matrix, which is the dimensionality D multiplied by the number of samples N,,Z represents the size of the low-dimensional output matrix, which is the dimensionality d multiplied by N,
, linear mapping is
, the distance matrix between every pair of points in high-dimensional space is A, Sw and Sb are the within-class and between-class scatter matrices of LDA, respectively, k indicates the neighboring relationship between a point and k points in manifold learning, F represents the linear combination matrix of a point in high-dimensional space formed by several surrounding points,
. − indicates uncertainty.P is the probability matrix of the distance between two points in high-dimensional space occupying the total distance,Q is the probability matrix of the distance between two points in low-dimensional space.
indicates the number of layers in a fully connected neural network,
indicates the number of nodes in each layer.
Figure 1 Comparison of Different Dimensionality Reduction Algorithms
Here, whether the autoencoder is centered is still somewhat questionable. When processing image data, the input image is preprocessed to have a mean of 0, but this operation is aimed at reducing the mean for one sample [1], whereas the centering here refers to reducing the mean for a particular dimension, which is not the same concept. Now, let’s discuss the content related to dimensionality reduction in detail.
1. Basic Concepts of Dimensionality Reduction
Dimensionality reduction means being able to represent a vector of size D with a set of vectors of size d that contains useful information, whered<D. Assuming we have a 512*512 size image, the most straightforward approach to classify it using SVM is to unfold the image into a vector of length 512*512 by row or column
, and multiply it with SVM parameters. If we can reduce the 512*512 vector while retaining useful information to 100, then the space required to store the input and parameters will be significantly reduced, and the time for calculating vector multiplication will also be greatly reduced. Therefore, dimensionality reduction can effectively reduce computation time. In high-dimensional space, data is likely to be sparsely distributed, meaning that 100 samples in a 100-dimensional space will definitely be very sparse. The number of samples needed increases exponentially with each additional dimension, and this sparsity problem in high-dimensional space is called the curse of dimensionality. Dimensionality reduction can alleviate this issue.
The reason we can perform dimensionality reduction is that data contains redundancy, either in the form of useless information or repeated information. For example, in a 512*512 image, only the central 100*100 area has non-zero values, while the rest is useless information. Alternatively, if an image is centrally symmetric, then the symmetric parts of the information are redundant. Properly reduced dimensionality data generally retains most of the important information of the original data, and it can entirely replace the input to perform other tasks, thus significantly reducing computational load. For instance, it can be reduced to two or three dimensions for visualization.
2. Perspectives for Dimensionality Reduction
Generally, there are two perspectives to consider when performing dimensionality reduction: one is to directly extract a subset of features for feature extraction, such as taking only the central part from a 512*512 image, the other is to transform the original high-dimensional space into a new space through linear/non-linear methods, which will be the main discussion here. The latter perspective generally has two approaches to achieve [2]: one is based on projection methods that map from high-dimensional space to low-dimensional space, with PCA being the representative algorithm, while LDA and Autoencoder also fall into this category. The main goal is to learn or compute a transformation matrix W, which is then multiplied with high-dimensional data to obtain low-dimensional data. The other is based on manifold learning methods, which aim to find a low-dimensional description of samples in high-dimensional space. It assumes that the data will exhibit a regular low-dimensional manifold arrangement in high-dimensional space, but this regular arrangement cannot be directly measured by the Euclidean distance in high-dimensional space. As shown in the left figure below, the actual distance between two points should be the distance in the unfolded right figure. If there is a method to describe the manifold in high-dimensional space, then during dimensionality reduction, this spatial relationship can be preserved. To solve this problem, manifold learning assumes that local regions in high-dimensional space still possess the properties of Euclidean space, meaning their distances can be calculated using Euclidean distance (Isomap), or the coordinates of a point can be calculated as a linear combination of neighboring nodes (LLE). This way, a relationship in high-dimensional space can be obtained and preserved in low-dimensional space, allowing manifold learning to be used for data compression, visualization, and obtaining effective distance matrices, etc.
Figure 2 Manifold Learning
3. Processes of Several Dimensionality Reduction Methods
3.1 Principal Component Analysis (PCA)
PCA was invented by Karl Pearson in 1901. It is a linear dimensionality reduction method. A point in high-dimensional space (dimension D) is mapped to a point in low-dimensional space (dimension d,d<D) by multiplying with matrix W
, where W has a size ofD∗d, and i corresponds to the i-th sample point. Thus, we can obtain N points mapped from D-dimensional space to d-dimensional space. The goal of PCA is to separate the mapped points as much as possible, that is, to maximize the variance of N points
. If the mean of each dimension in D-dimensional space is 0, that is
, then multiplying both sides by
results in the mean of the reduced data also being 0. Considering a matrix
, this matrix is the covariance matrix of this set of D-dimensional data, where the diagonal values represent the variance of a certain dimension within D, and the non-diagonal elements represent the covariance between two dimensions in D.
Then, for the covariance matrix of the reduced d-dimensional data, if we want the points after dimensionality reduction to be as far apart as possible, we hope that the values on the diagonal of B, that is, the variance of each dimension, are as large as possible. A large variance indicates that the data in these dimensions have excellent discriminability, while we also hope that each dimension of d is orthogonal, meaning that they are independent, thus not containing overlapping information. This way, the data can best represent itself, with each dimension having sufficient discriminability and different information. In this case, all non-diagonal values of B are 0. It can also be derived that
This equation actually indicates that the role of the linear transformation matrix W in the PCA algorithm is to diagonalize the original covariance matrix C. Since diagonalization in linear algebra is obtained through solving eigenvalues and corresponding eigenvectors, we can derive the PCA algorithm process (the process is mainly taken from Professor Zhou Zhihua’s book “Machine Learning”, which includes objectives and assumptions for comparison with later algorithms. Professor Zhou’s book derives it based on the Lagrange multiplier method, which is essentially the same as [3]. I highly recommend this blog discussing the mathematical principles of PCA [3]).
-
Input: N D-dimensional vectors
, reduced to d dimensions
-
Output: Projection matrix
, where each
is a D-dimensional column vector
-
Goal: Project the reduced data as far apart as possible,
(here, the trace is because the non-diagonal elements of B mentioned above are all 0, while the diagonal elements are exactly the variances of each dimension)
-
Assumption: The variances of the reduced data in each dimension are as large as possible, and each dimension is orthogonal
-
1. Center each dimension of the input to have a mean of 0
-
2. Calculate the covariance matrix of the input
-
3. Perform eigenvalue decomposition on the covariance matrix C
-
4. Select the top d eigenvectors corresponding to the largest eigenvalues
Additionally, PCA has many variants such as kernel PCA, probabilistic PCA, etc. This article will only consider the simplest version of PCA for now.
3.2 Multidimensional Scaling (MDS)
The goal of MDS is to preserve the dissimilarity of the data during dimensionality reduction, which can also be understood as keeping the distance relationships in high-dimensional space unchanged in low-dimensional space. The distances here are represented by a matrix, where the pairwise distances of N samples are represented by each entry of matrix A, and it is assumed that the distances in low-dimensional space are Euclidean distances. The reduced data is represented as
, so we have
. The right-hand side of the equation is uniformly represented by the inner product matrix E
. After centering, the sum of each row and column of E is 0, leading to the derivation of
where is the identity matrix I minus the all-ones matrix
,i⋅ and j denote the sum of a particular row or column, thus establishing the relationship between the distance matrix A and the inner product matrix E. Therefore, given A, E can be solved, and then by performing eigenvalue decomposition on E, we can obtain
, where Λ is the diagonal matrix, with each item being an eigenvalue of Eλ1≥…≥λd, thus the data can be represented in d-dimensional space under all eigenvalues
so that selecting d largest eigenvalues allows the distance matrix in d-dimensional space to approximate the distance matrix in high-dimensional space D. The MDS process is as follows [2]:
-
Input: Distance matrix
, where the superscript indicates the size of the matrix; the original data is D-dimensional, reduced to d-dimensional
-
Output: Reduced matrix
-
Goal: Maintain the relative relationships between data during dimensionality reduction
-
Assumption: The distance matrix of N samples is known
-
1. Calculate $a{iullet}, a{ullet j}, a_{ullet ullet}$
-
2. Calculate the inner product matrix E
-
3. Perform eigenvalue decomposition on E
-
4. Select d largest eigenvalues
, and arrange the corresponding eigenvectors in order to construct
3.3 Linear Discriminant Analysis (LDA)
LDA was initially proposed by Fisher in 1936 to solve the binary classification problem. Since the calculation process actually performs dimensionality reduction on the data, it can also be used for supervised linear dimensionality reduction. It projects high-dimensional space data into low-dimensional space, determining the class to which each sample belongs. Here, we consider the case of K classes. Its goal is to classify the samples as accurately as possible into K classes, reflected in the fact that the projection points of the same class are as close as possible, while those of different classes are as far apart as possible. This differs from PCA, where PCA aims to separate all samples as much as possible on a certain dimension, and LDA’s low-dimensional projections may overlap, which is not the case for PCA. The dimensionality reduction approach used by LDA is similar to PCA, both of which use matrix multiplication for linear dimensionality reduction, and the projection points are. Assuming the projection is in the direction shown in the figure below, the corresponding original high-dimensional points for the projection center areμ1, μ2. Since we hope that samples belonging to different classes are as far apart as possible, we want the projection center points to be as far apart as possible, that is, the goal is
, but having the centers far apart is not enough. For example, if we project perpendicular to the x1 axis, the two centers may be far apart enough, but some samples overlap in the projection space. Therefore, there is an additional optimization goal that the projection points of the same class should be as close as possible. Thus, the within-class covariance of the same class samples should be minimized, and the covariance matrix of the same class samples is as follows.
Figure 3 LDA Projection (Image Source [4])
Where μ1=(u1,…,uN),W=(w1,…,wd), represents the set of samples belonging to class 1, and the matrix in the middle
is the covariance matrix of the original data belonging to class
and the distance between the centers of the K classes is calculated by subtracting the centers directly. The distance between the centers of K classes needs to calculate the overall sample center
(
indicates the number of samples belonging to class k), and the inter-class scatter matrix is measured by
. To summarize, the optimization goal of the LDA algorithm is to maximize the following cost function:
In the case of binary classification, the size of W isD∗1, that is, the J(W) itself is a scalar. In the case of K classes, the size of W isD∗d−1, and the optimization goal is to calculate the trace of the upper and lower matrices. Here, it can be seen that LDA does not center the data, as centering each class’s center would lead to overlap, so this algorithm does not center the data. ThenJ(W) is differentiated with respect to W, indicating that the solution for W is a matrix composed of the d-1 largest eigenvalues corresponding to the eigenvectors. Thus, W can be used to reduce X.
-
Input: N D-dimensional vectors
, data can be divided into d classes
-
Output: Projection matrix $W=(w1, …, w{d-1})$, where each, where eachw_i$ is a D-dimensional column vector
-
Goal: Minimize the within-class covariance of samples in the same class after projection, while maximizing the distance between centers of different classes
-
Assumption: The optimization goal is to maximize
-
1. Calculate the within-class scatter matrixSw and the between-class scatter matrixSb
-
2. Perform singular value decomposition on
, obtaining
-
3. Perform eigenvalue decomposition on the matrix
-
4. Select the top d-1 eigenvalues corresponding to the largest eigenvalues
The optimization goal here actually reflects an assumption, that is, the optimization goal expressions above and below are diagonal matrices, making Sd and Sw both become diagonal matrices.
3.4 Isometric Mapping (Isomap)
As previously mentioned, MDS only performs dimensionality reduction on the data and requires known distance relationships in high-dimensional space. However, it cannot reflect the underlying manifold of the high-dimensional data. We can combine the basic ideas of manifold learning with MDS for dimensionality reduction [5]. That is, the distances in local areas of high-dimensional space can be calculated using Euclidean distances. For the distance matrix A in MDS, the distance between two neighboring points is, which is their Euclidean distance. Points that are relatively close are determined through the shortest path algorithm, while the distance between two points that are far apart isAij = ∞. By determining the matrix A, we need to judge which points are neighbors. Isomap uses KNN to determine neighboring points. The overall algorithm process is as follows:
-
Input: N D-dimensional vectors
, where each point has K neighboring points, mapping to d dimensions
-
Output: Reduced matrix
-
Goal: Maintain the manifold of high-dimensional data during dimensionality reduction
-
Assumption: The distances between two points in the local area of high-dimensional space can be calculated using Euclidean distances
-
1. Use KNN to construct part of A, finding neighboring points and filling in their Euclidean distances, while initializing all other positions to infinity
-
2. Use the shortest path algorithm (Dijkstra’s algorithm) to find the paths between relatively close points and fill in their distances
-
3. Use the distance matrix A as the input for MDS to obtain the output
3.5 Locally Linear Embedding (LLE)
As previously mentioned, the local regions of manifold learning possess the properties of Euclidean space. In LLE, it is assumed that the coordinates of a point xi can be expressed as a linear combination of its surrounding points, that is(where Xi represents the set of neighboring points of xi ), which is also a representation in high-dimensional space. Since this relationship is preserved in low-dimensional space, we have
where the weights are the same in both equations.
Based on the above assumption, we first need to find a way to solve this weight. Assuming each sample point is obtained from surrounding K samples, the size of the linear combination weight for a sample should be1∗K. The weight can be solved by minimizing the reconstruction error, and the objective function is derived with respect to f.
After obtaining the weights, substitute them into the optimization objective of the low-dimensional space
to solve for Z. Here, F is arranged as N∗K and restrictions are added to Z. Using the Lagrange multiplier method, we can obtain the form of MZ=λY , thereby solving for Z through eigenvalue decomposition of M.
-
Input: N D-dimensional vectors
, where each point has K neighboring points, mapping to d dimensions
-
Output: Reduced matrix Z
-
Goal: Maintain the manifold of high-dimensional data during dimensionality reduction
-
Assumption: A point in high-dimensional space is a linear combination of its neighboring K points, and each dimension in low-dimensional space is orthogonal
-
1. Use KNN to construct part of A, finding K neighboring points, then calculate matrices F and M
-
2. Perform eigenvalue decomposition on M
-
3. Select the d smallest non-zero eigenvalues corresponding to the smallest eigenvalues to construct Z (here, because the objective is minimization, the smallest eigenvalues are selected)
3.6 t-SNE
t-SNE is also a method for reducing high-dimensional data to two or three-dimensional space. It was proposed by Maaten in 2008 [6], based on the Stochastic Neighbor Embedding (SNE) method proposed by Hinton in 2002. The main idea is to assume that any two points in high-dimensional space, xj , follow a Gaussian distribution centered at xi with variance σi , and similarly, xi follows a Gaussian distribution centered at xj with variance σj . Thus, the conditional probability of similarity between xj and xi is defined as
which indicates the proportion of the probability of xj under the Gaussian distribution centered at xi to the total probability of all samples under the Gaussian distribution centered at xi , reflecting the similarity between the two from the perspective of xi . Next, let pij = (pi|j + pj|i)/2n be used as the joint probability of the similarity between the two points in the total pairwise similarity of all samplespij . The formula below does not explain whether σ is a scalar or a vector, but in the subsequent solution, pij is not directly derived from the joint probability formula below, but rather through the previous conditional probabilities, where a specific value of σi is determined for each sample i.
Meanwhile, the mutual relationship or similarity between two points in low-dimensional space is also represented by joint probabilities, assuming that the Euclidean distance between two points in low-dimensional space follows a t-distribution with one degree of freedom. Thus, the probability of the distance between two points in low-dimensional space is expressed as a joint probability of all pairwise distances.
If the similarity values pij and qij calculated from the high-dimensional points xi and xj correspond to the low-dimensional points zi and zj , then it indicates that the low-dimensional points can accurately reflect the relative positional relationship in high-dimensional space. Therefore, the goal of t-SNE is to find a set of reduced representations that minimize the difference between pij and qij . As a result, t-SNE utilizes Kullback-Leibler divergence, or KL divergence, to construct the objective function, which can be used to measure the difference between two probability distributions. It uses gradient descent to find the low-dimensional representation corresponding to the input datazi, that is, taking the derivative of the objective function with respect to zi , treating zi as a variable to optimize, thus obtaining the gradient for each update of zi , and then performing iterative updates. In the actual update process, a momentum term is added to accelerate optimization, and the general algorithm process is as follows:
-
Input: N D-dimensional vectors
, mapping to two or three dimensions, with fixed values Perp, number of iterations T, learning rate η, and momentum term coefficient α(t)
-
Output: Reduced data representationz1,…,zN
-
Goal: Dimensionality reduction to two or three dimensions for visualization (the emphasis is on visualization)
-
Assumption: In high-dimensional space, the value of a point xj follows a Gaussian distribution centered at another point xi . In low-dimensional space, the Euclidean distance between two points follows a t-distribution with one degree of freedom
-
1. First determine xi ‘s σi through binary search
-
2. Calculate pairwise P{j|i}, obtaining, thus obtainingp{ij} = (p{j|i}+p{i|j})/2$
-
3. Initializez1,…,zN
-
4. Calculateqij
-
5. Calculate the gradient ∂J/∂zi
-
6. Update
-
7. Repeat steps 4-6 until convergence or until the number of iterations T is reached
It should be noted that this algorithm treats the low-dimensional data as variables for iteration. Therefore, if new data needs to be added, it cannot be directly operated on the new data; instead, the new data must be added to the original data, and the entire calculation must be redone. Thus, the main function of t-SNE is still visualization.
3.7 Deep Autoencoder Networks
Autoencoders are a type of neural network, an unsupervised algorithm that can be used for dimensionality reduction and also for automatically learning certain features from data. The principle of this neural network is to input a set of values and obtain a set of outputs after passing through the network, where this output set of values should be as close as possible to the input values. The network consists of fully connected layers, where each node in each layer is connected to all nodes in the previous layer. The structure of the autoencoder is shown in Figure 4, where the encoder network is a standard forward propagation neural networkz=W x+b, and the parameters of the decoder network are the transposed parameters of the paired symmetric structure. After passing through this network, the value is, and finally, when propagated to the layer equal to the number of input nodes in the network, a set of valuesx′ is obtained, where the network hopes these two values are equalx′=x. This value is obtained through cross-entropy or mean squared error to yield the reconstruction error cost function, which is minimized using gradient descent to learn the correct parameters for the network. Thus, this network can project high-dimensional data into low-dimensional space through the “encoder” network and then reverse the low-dimensional data back to high-dimensional space through the “decoder” network.
Figure 4 Autoencoder Network Structure
However, during the actual implementation of the network, the entire network actually consists of only half of the layers shown in Figure 4, a four-layer network with a fully connected structure of 2000-1000-500-30. This is because the weight parameters are actually the same in the encoder and decoder. The encoder process takes the node values from the previous layer multiplied by the weights to obtain the node values of this layer, while the decoder takes the node values of this layer multiplied by the transposed weight matrix to obtain the node values of the previous layer. The following image [7] clearly demonstrates the actual structure of each layer, including a forward and backward propagation process. Thus, the values from the top layer can be used as the output of the network for dimensionality reduction for further analysis, such as visualization or as compressed features.
Figure 5 Layer Structure of Autoencoder
In 2006, Hinton published an article in Science discussing how to use deep learning’s autoencoder network for dimensionality reduction [8], primarily proposing a method to pre-train weight parameters through multiple layers of RBM to address the issue of the quality of dimensionality reduction relying on the initialization of network weights. In the expression above, no nonlinear transformations were added; in a real network, each layer requires adding nonlinear transformations after performing matrix multiplication with weights. Furthermore, the autoencoder model can incorporate the property of sparsity [9], meaning that for N D-dimensional inputs, the sum of the output values of a certain node in a certain layer approaches 0, that is,
, where l indicates which layer it is, and i indicates which input it is. Regularization terms can also be added to the weights.
-
Input: N D-dimensional vectorsx1,…,xN, network structure indicating the number of nodes in each layer, number of iterations T, learning rateη
-
Output: Reduced data representationz1,…,zN
-
Goal: The network can learn certain properties or structures of the data, allowing for the reconstruction of the input data
-
Assumption: The neural network is powerful enough to learn features, haha
-
1. Set the number of layers and the number of nodes in each layer
-
2. Initialize weight parameters
-
3. Perform forward propagation to calculate the node values of the next layerz=W x+b
-
4. Perform backward propagation to calculate the node values of the previous layer
-
5. Calculate the gradient for each layer with respect to the input and the layer parameters W, and use backpropagation to propagate the error throughout the network
-
6. Update the gradients for both W and
and then updateW
-
7. Repeat steps 3-6 until convergence or until the number of iterations T is reached
4. Summary
This article mainly focuses on the processes of the algorithms, detailing what each step does. Some theoretical explanations may not be sufficiently clear. Interestingly, aside from t-SNE and autoencoders, the other dimensionality reduction algorithms are based on constructing a certain matrix and then performing eigenvalue decomposition on that matrix to obtain the relevant ZZ or WW. The Laplacian Eigenmaps have not been fully studied, but it is also interesting that the algorithm ultimately selects the top d smallest non-zero eigenvalues. This raises questions about why methods based on eigenvalues yield such good results, especially for those with weaker mathematical foundations. Comparing a single-layer autoencoder with PCA, assuming the objective function of the autoencoder is to minimize the mean squared error, although the autoencoder does not impose as strong constraints as PCA (which requires orthogonality), it may still learn well because maximizing the trace of covariance is equivalent to minimizing the mean squared error. These methods often seem to have some underlying connections, raising the question of whether a unified model can be extracted to solve the dimensionality reduction problem.
5. Code Appendix
There is a wealth of information available online about various dimensionality reduction algorithms, although most do not provide source code. Here is a GitHub project (Click to access the original article directly) that organizes 11 classic data extraction (dimensionality reduction) algorithms implemented in Python, including PCA, LDA, MDS, LLE, TSNE, etc., along with related materials and demonstration effects; it is very suitable for beginners in machine learning and those just starting in data mining.