Summary of Dimensionality Reduction Algorithms in Machine Learning

Source: Algorithm Advancement

This article is about 7500 words long and is recommended for a 10-minute read.
It provides an information table of seven algorithms, summarizing the parameters and main objectives of each algorithm, and introduces some basic concepts of dimensionality reduction.

Recently, I have been looking into some dimensionality reduction algorithms. This article first provides an information table of seven algorithms, summarizing the (hyper)parameters that can be adjusted for each algorithm, the main objectives of the algorithms, and then introduces some basic concepts of dimensionality reduction, including what dimensionality reduction is, why it is necessary, and how it can solve the curse of dimensionality. Next, it analyzes the perspectives from which dimensionality reduction can be approached, and then organizes the specific processes of these algorithms. The main directory is as follows:

1. Basic Concepts of Dimensionality Reduction

2. Perspectives for Dimensionality Reduction

3. Dimensionality Reduction Algorithms

3.1 Principal Component Analysis (PCA)
3.2 Multidimensional Scaling (MDS)
3.3 Linear Discriminant Analysis (LDA)
3.4 Isometric Mapping (Isomap)
3.5 Locally Linear Embedding (LLE)
3.6 t-SNE
3.7 Deep Autoencoder Networks

4. Summary

5. Code Appendix

As usual, let’s first present a comparison table of various dimensionality reduction algorithms. X indicates that the size of the high-dimensional input matrix is D multiplied by N (the number of samples),

, Z indicates the size of the dimensionality reduced output matrix which is d multiplied by N,

, linear mapping is

, the distance matrix between points in high-dimensional space is A, Sw and Sb are the within-class and between-class scatter matrices of LDA, respectively, k indicates the proximity relationship between a point and k points in manifold learning, F represents the linear combination matrix of a point in high-dimensional space from its surrounding points,

.- indicates uncertainty. P is the probability matrix representing the proportion of distances between two points in high-dimensional space, and Q is the probability matrix representing the proportion of distances between two points in low-dimensional space.

indicates the number of layers in a fully connected neural network,

indicates the number of nodes in each layer.

Figure 1 Comparison of Different Dimensionality Reduction Algorithms

There is still some doubt about whether autoencoders are centered. When processing image data, the input images are preprocessed to have a mean of 0, but this operation is aimed at reducing the mean for a single sample, whereas the centering here refers to reducing the mean for a certain dimension, which is not the same concept. Now let’s start discussing the relevant content of dimensionality reduction.

1. Basic Concepts of Dimensionality Reduction

Dimensionality reduction means being able to represent a vector of size D with a vector of size d,

where d < D. Assuming we have a 512×512 image and we use SVM for classification, the most direct approach is to unfold the image into a vector of length 512*512 by rows or columns,

, and multiply it by the SVM parameters. If we can reduce the 512*512 vector to 100 while retaining useful information, the storage space for inputs and parameters will be greatly reduced, and the time for calculating vector multiplication will also decrease significantly. Therefore, dimensionality reduction can effectively reduce computation time. Data in high-dimensional space is likely to be sparsely distributed, meaning that 100 samples in a 100-dimensional space will definitely be very sparse; the number of samples required increases exponentially with each additional dimension. This problem of sample sparsity in high-dimensional space is known as the curse of dimensionality. Dimensionality reduction can alleviate this issue.

The reason we can perform dimensionality reduction is that data has redundancy, either in the form of useless information or repeated expressions. For example, in a 512×512 image, only the center 100×100 area contains non-zero values, while the rest is useless information. Alternatively, if an image is centrally symmetric, the symmetric portion of the information is repeated. Proper dimensionality-reduced data generally retains most of the important information of the original data, allowing it to completely replace the input for other tasks, thereby significantly reducing computational load. For instance, reducing it to two or three dimensions for visualization.

2. Perspectives for Dimensionality Reduction

Generally, there are two perspectives to consider for dimensionality reduction: one is direct feature subset extraction, for example, extracting only the center part of a 512×512 image, the other is transforming the original high-dimensional space into a new space through linear/non-linear methods. This article mainly discusses the latter perspective. The latter perspective generally has two approaches [2]: one is based on projection methods that map high-dimensional space to low-dimensional space, where PCA is the representative algorithm, and others like LDA and Autoencoder also fall under this category. The main objective is to learn or calculate a transformation matrix W, which is then multiplied by the high-dimensional data to obtain low-dimensional data. The other is based on manifold learning, which aims to find a low-dimensional description of high-dimensional space samples. It assumes that data in high-dimensional space exhibits a regular low-dimensional manifold arrangement, but this regular arrangement cannot be directly measured by the Euclidean distance in high-dimensional space, as shown in the left diagram, where the actual distance between two points should be the distance in the right diagram when unfolded. If a method can describe the manifold in high-dimensional space, then during dimensionality reduction, this spatial relationship can be preserved. To address this issue, manifold learning assumes that local regions in high-dimensional space still have Euclidean properties, meaning their distances can be calculated using Euclidean distance (Isomap), or the coordinates of a point can be calculated as a linear combination of nearby nodes (LLE), thus obtaining a relationship in high-dimensional space that can be preserved in low-dimensional space. Therefore, manifold learning can be used for data compression, visualization, and obtaining effective distance matrices.

Figure 2 Manifold Learning

3. Algorithm Processes for Dimensionality Reduction

3.1 Principal Component Analysis (PCA)

PCA was invented by Karl Pearson in 1901 and is a linear dimensionality reduction method. A point in high-dimensional space (with dimension D)is mapped to a point in low-dimensional space (with dimension d, d < D) by multiplying with matrix W, where W has dimensions D*d, and i corresponds to the i-th sample point. Thus, we can obtain N points mapped from D-dimensional space to d-dimensional space. The goal of PCA is to separate the mapped points as much as possible, meaning that the variance of the N pointsshould be maximized. If the mean of each dimension in the D-dimensional space is 0, that is, , then multiplying both sides by results in the reduced data also having a mean of 0. Considering a matrix, this matrix is the covariance matrix of the D-dimensional data. The diagonal values represent the variance in each dimension of D, while the off-diagonal elements represent the covariance between two dimensions in D.

Now, for the covariance matrix of the reduced d-dimensional data, if we want the points in the reduced dimension to be as separated as possible, we want the diagonal values of B, which represent the variance in each dimension, to be as large as possible. A larger variance indicates that the data in these dimensions has good discriminability, and we also want each dimension of d to be orthogonal; orthogonality ensures that two dimensions are independent and do not contain overlapping information, which best represents the data, with each dimension having sufficient discriminability and different information. In this case, all off-diagonal values of B should be zero. It can also be derived that:

This equation indicates that the role of the linear transformation matrix W in the PCA algorithm is to diagonalize the original covariance matrix C. Since diagonalization in linear algebra is achieved by solving for eigenvalues and corresponding eigenvectors, we can derive the PCA algorithm process (the process is mainly extracted from Professor Zhou Zhihua’s book “Machine Learning”, which adds objectives and hypotheses for comparison with subsequent algorithms. The derivation in Professor Zhou’s book is based on the Lagrange multiplier method, which is essentially the same as [3]. I highly recommend this blog post that discusses the mathematical principles of PCA [3]).

Input: N D-dimensional vectors, reduced to d dimensions
Output: Projection matrix, where eachis a D-dimensional column vector
Goal: To separate the data as much as possible after projection,(the trace here is because the above-mentioned off-diagonal elements of B are all 0, while the diagonal elements are precisely the variances of each dimension)

Hypothesis: The variance of each dimension of the reduced data is as large as possible, and each dimension is orthogonal
1. Center each dimension of the input to 0
2. Calculate the covariance matrix of the input

3. Perform eigenvalue decomposition on the covariance matrix C
4. Take the top d eigenvalues and their corresponding eigenvectors

In addition, PCA has many variants such as kernel PCA, probabilistic PCA, etc. This article only considers the simplest version of PCA for now.

3.2 Multidimensional Scaling (MDS)

The goal of MDS is to preserve the dissimilarity of the data during dimensionality reduction, which can also be understood as keeping the distance relationships in high-dimensional space unchanged in low-dimensional space. The distances here are represented by a matrix, where the pairwise distances of N samples are represented by each item in matrix A, and it is assumed that the distances in low-dimensional space are Euclidean distances. The reduced data is represented as, thus we have. The three items on the right are uniformly represented by the inner product matrix E. After centering, each row and column sum of E is 0, leading to the derivation of

whereis the identity matrix I minus the all-ones matrix, where i⋅ and ⋅j refer to the sum of a certain row or column, thus establishing the relationship between the distance matrix A and the inner product matrix E. Therefore, knowing A allows us to solve for E, and then by performing eigenvalue decomposition on E, we can represent the data under all eigenvalues as, and when selecting d largest eigenvalues, we can approximate the distance matrix in d-dimensional space to that in high-dimensional space D. The MDS process is as follows [2]:

Input: Distance matrix, superscript indicates matrix size, original data is D-dimensional, reduced to d-dimensional
Output: Reduced matrix
Goal: To maintain the relative relationships between the data during dimensionality reduction
Hypothesis: The distance matrix of N samples is known
1. Calculate $a{iullet}, a{ullet j}, a_{ullet ullet}$
2. Calculate the inner product matrix E
3. Perform eigenvalue decomposition on E
4. Take the d largest eigenvalues to form, with the corresponding eigenvectors arranged in order to form

3.3 Linear Discriminant Analysis (LDA)

LDA was initially proposed by Fisher in 1936 to solve binary classification problems. Since the calculation process effectively reduces the dimensionality of the data, it can also be used for supervised linear dimensionality reduction. It projects high-dimensional space data onto low-dimensional space to determine the class to which each sample belongs, considering the case of K classes. Its goal is to classify samples as accurately as possible into K classes, which is reflected in the proximity of projection points of the same class and the distance between projection points of different classes. This is different from PCA, where PCA aims to separate all samples in a certain dimension, whereas LDA’s low-dimensional projections may overlap, which PCA does not allow. The dimensionality reduction approach it employs is similar to PCA, both using matrix multiplication for linear dimensionality reduction, with the projection point being. Assuming a projection direction as shown in the diagram below, the projection centers correspond to the original high-dimensional points μ1 and μ2. Since we want samples belonging to different classes to be as far apart as possible, we hope that the projected centers are as distant as possible, meaning the objective is, but having the centers far apart is not enough; for example, projecting perpendicular to the x1 axis may result in the centers being far apart, but some samples overlap in the projection space, so an additional optimization objective is needed: the projection points of the same class should be as close as possible. Therefore, the within-class scatter of the same class samples should be minimized, and the covariance matrix for the same class samples is as follows.

Figure 3 LDA Projection (Image Source [4])

Where μ1=(u1,…,uN), W=(w1,…,wd),represents the set of samples belonging to class 1, and the middle matrixis the within-class scatter matrix of the original data for K classes, while the distance between the centers of the two classes is calculated by subtracting the centers directly, and the distance between the projection centers of K classes needs to first calculate the overall sample center(indicates the number of samples belonging to class k), and the inter-class scatter matrix is calculated by summing the covariance matrices of all original data samples. To summarize, the optimization objective of the LDA algorithm is to maximize the following cost function:

In the case of binary classification, W has dimensions D*1, meaning J(W) is a scalar. For K classes, W has dimensions D*(d-1), and the optimization goal is to compute the trace of the matrix. It can be observed that LDA does not center the data; if the centers of each class were centered, they would overlap, which is why this algorithm does not perform centering. Then, taking the derivative of J(W) with respect to W, this expression indicates that the solution for W iswhich is composed of the d-1 largest eigenvalues corresponding to the eigenvectors. Thus, we can use W to reduce X.

Input: N D-dimensional vectors, data can be divided into d classes
Output: Projection matrix $W=(w1, …, w{d-1})$, where each $w_i$ is a D-dimensional column vector
Goal: Minimize the covariance between samples of the same class after projection, while maximizing the distance between centers of different classes
Hypothesis: The optimization goal is to maximize
1. Calculate the within-class scatter matrix Sw and the between-class scatter matrix Sb
2. Perform singular value decompositionto obtain

3. Perform eigenvalue decomposition on the matrix
4. Take the top d-1 eigenvalues and their corresponding eigenvectors $w1,…,w{d-1}$

I personally believe that the optimization objective here reflects a hypothesis, namely that the expressions in the optimization objectives above and below are diagonal matrices, and that the transformation W makes Sd and Sw both diagonal matrices.

3.4 Isometric Mapping (Isomap)

The previously mentioned MDS only performs dimensionality reduction; it requires the known distance relationships in high-dimensional space and cannot reflect the potential manifold of high-dimensional data. However, it can combine the basic ideas of manifold learning with MDS for dimensionality reduction [5]. That is, the local distances in high-dimensional space can be calculated using Euclidean distances. For the distance matrix A of MDS, the distance between two adjacent points iswhich is their Euclidean distance. Points that are relatively close are determined by the shortest path algorithm, while the distance between points that are further apart is set to Aij = ∞. Once the matrix A is established, it involves determining which points are neighbors; Isomap uses KNN to identify neighboring points. The overall algorithm process is as follows:

Input: N D-dimensional vectors, each point has K neighboring points, mapped to d dimensions
Output: Reduced matrix
Goal: To preserve the manifold of high-dimensional data during dimensionality reduction
Hypothesis: The local distances between two points in high-dimensional space can be calculated using Euclidean distances
1. Use KNN to construct part of A, that is, find the neighboring points and fill in their Euclidean distances in Aij, initializing the others to infinity
2. Use the shortest path algorithm (Dijkstra’s algorithm) to find the paths between relatively close points and fill in the distances
3. Use the distance matrix A as input for MDS to obtain the output

3.5 Locally Linear Embedding (LLE)

As previously mentioned, the local regions of manifold learning possess Euclidean properties. In LLE, it is assumed that the coordinates of a point xi can be expressed as a linear combination of its surrounding points, that is, (where Xi represents the set of points in the neighborhood of xi), which is also a representation in high-dimensional space. Since this relationship is also preserved in low-dimensional space,the weights in both equations are the same.

Based on the above assumptions, the first step is to solve for these weights. Assuming each sample point is derived from K surrounding samples, the size of the linear combination weights should be 1*K. The weights are obtained by minimizing the reconstruction error, and the objective function is solved with respect to f.

After obtaining the weights, substitute them into the optimization objective for low-dimensional space:

to solve for Z, where F is arranged in N*K, with restrictions applied to Z. The Lagrange multiplier method yields the form MZ=λY, thus solving for Z through eigenvalue decomposition of M.

Input: N D-dimensional vectors, each point has K neighboring points, mapped to d dimensions
Output: Reduced matrix Z
Goal: To preserve the manifold of high-dimensional data during dimensionality reduction
Hypothesis: The coordinates of a point in high-dimensional space are a linear combination of its K neighbors, and the dimensions in low-dimensional space are orthogonal
1. Use KNN to construct part of A, that is, find K neighboring points and then calculate matrices F and M
2. Perform eigenvalue decomposition on M
3. Take the d smallest non-zero eigenvalues and their corresponding eigenvectors to form Z (since the objective is to minimize, the smallest eigenvalues are taken)

3.6 t-SNE

t-SNE is also a method for reducing high-dimensional data to two or three-dimensional space. It was proposed by Maaten in 2008 [6], based on the Stochastic Neighbor Embedding (SNE) method proposed by Hinton in 2002. The main idea is to assume that any two points in high-dimensional space, xj, follow a Gaussian distribution centered at xi with a standard deviation of σi, and similarly, xi follows a Gaussian distribution centered at xj with a standard deviation of σj. Thus, the conditional probability of xj being similar to xi is given by

This indicates the proportion of the probability of xj under the Gaussian distribution centered at xi to the total probability of all samples under the Gaussian distribution centered at xi, reflecting the similarity between the two from the perspective of xi. Then, let pij =(pi|j+pj|i)/2n use this probability as the joint probability of the similarity between the two points among all samples. The formula is as follows; the paper does not explain whether σ is a scalar or a vector, but since in subsequent calculations, pij is not directly obtained from the joint probability formula below, but rather through the earlier conditional probabilities, the earlier equations calculate a σi for each sample i, providing a specific determined value, where. Then, by binary search, the corresponding σi for xi is determined such that substituting into the above two equations equals the value of Prep. Therefore, σ is likely to be a vector. It is unlikely that all samples share a single Gaussian distribution parameter.

Simultaneously, the mutual relationship or similarity between two points in low-dimensional space is also represented by joint probabilities. It is assumed that the Euclidean distance between two points in low-dimensional space follows a Student’s t-distribution with one degree of freedom, thus the probability of two points’ distances in low-dimensional space is represented as their joint probability among all two-point distance probabilities.

If the computed similarity values pij and qij in high-dimensional space xi, xj are equal to those in low-dimensional space zi, zj, it indicates that the points in low-dimensional space can accurately reflect the relative positional relationships in high-dimensional space. Therefore, the goal of t-SNE is to find a set of reduced representations that minimize the difference between pij and qij. Thus, t-SNE employs Kullback-Leibler divergence (KL divergence) to construct the objective function, which can be used to measure the differences between two probability distributions. It uses gradient descent to solve for the low-dimensional representations corresponding to the input data zi, treating the objective function as a variable to optimize, thus obtaining the gradient of zi for each iteration, and then updating zi iteratively. In the actual update process, a momentum term is added for optimization acceleration, and the approximate algorithm process is as follows:

Input: N D-dimensional vectors, mapped to two or three dimensions, with fixed values Perp, iteration count T, learning rate η, and momentum term coefficient α(t)
Output: Reduced data representations z1,…,zN
Goal: To visualize in two or three dimensions (the focus is on visualization)
Hypothesis: In high-dimensional space, the value of a point xj follows a Gaussian distribution centered at another point xi. In low-dimensional space, the Euclidean distance between two points follows a t-distribution with one degree of freedom
1. Use binary search to determine σi corresponding to xi
2. Calculate pairwise $P{j|i}$, obtaining $p{ij} = (p{j|i}+p{i|j})/2$
3. Initialize z1,…,zN
4. Calculate qij
5. Calculate the gradient ∂J/∂zi
6. Update

7. Repeat steps 4-6 until convergence or the iteration count T is completed

It is important to note that this algorithm treats low-dimensional data as variables for iteration. Therefore, if new data needs to be inserted, it cannot be directly operated on; instead, the new data must be added to the original data, and the entire calculation must be redone. Hence, the main function of t-SNE is still visualization.

3.7 Deep Autoencoder Networks

Autoencoders are a type of neural network that is an unsupervised algorithm, capable of both dimensionality reduction and automatically learning certain features from data. The principle of this neural network is that it takes a set of values as input and produces an output that is as close as possible to the input values. The network consists of fully connected layers, where each node in each layer connects to all nodes in the previous layer. The structure of the Autoencoder is shown in Figure 4 below, where the encoder network performs normal forward propagation z=W x+b, and the decoder network’s propagation parameters are the transposed layer parameters in symmetric structure, resulting in the output being, and finally, when it propagates to a layer equal to the number of inputs, it obtains a set of values x′, with the network hoping these two values are equal x′=x. The difference between this value and the true input x is obtained through cross-entropy or mean squared error, yielding the reconstruction error cost function, which is minimized through gradient descent to allow the network to learn the correct parameters. Thus, this network first projects the high-dimensional data into low-dimensional space through the “encoder” network, and then reverses the low-dimensional data back to high-dimensional space through the “decoder” network.

Figure 4 Structure of Autoencoder Network

However, in the actual implementation of the network, the entire network only consists of half the layers shown in Figure 4, that is, a 4-layer network with a fully connected structure of 2000-1000-500-30. This is because the weight parameters in the encoder and decoder are the same; the encoder process involves multiplying the node values of the previous layer by weights to obtain the node values of this layer, while the decoder multiplies the node values of this layer by the transposed weight matrix to obtain the node values of the previous layer. The following figure [7] clearly shows the actual structure of each layer, including one forward propagation and one backward propagation, allowing the top layer’s values to be used as the network’s dimensionality-reduced output for further analysis, such as visualization or as compressed features.

Figure 5 Inter-layer Structure of Autoencoder

In 2006, Hinton published an article in Science discussing how to use deep learning’s autoencoder networks for dimensionality reduction [8], primarily proposing a method for effectively initializing weights to address the issue of the quality of autoencoder dimensionality reduction being dependent on the initialization of network weights. The above expression does not include non-linear transformations; in the actual network, each layer performs matrix multiplication with weights followed by a non-linear transformation. Additionally, the autoencoder model can incorporate sparsity properties [9], meaning that for N D-dimensional inputs, the sum of output values of a certain node in a certain layer approaches 0, that is, , where l indicates which layer it is, and i indicates which input it is. Regularization terms can also be added to impose constraints on weights.

Input: N D-dimensional vectors x1,…,xN, network structure specifying the number of nodes in each layer, iteration count T, learning rate η
Output: Reduced data representations z1,…,zN
Goal: To learn the internal properties or structures of the data to reconstruct the input data
Hypothesis: The neural network is exceptionally powerful and can learn features effectively.
1. Set the number of layers and the number of nodes in each layer
2. Initialize the weight parameters
3. Perform forward propagation to calculate the values of the next layer z=W x+b
4. Perform backward propagation to calculate the values of the previous layer
5. Calculate the gradients for each layer with respect to the input and the layer parameters W, using backpropagation to propagate the error throughout the network
6. Sum the gradients for W and update accordingly
7. Repeat steps 3-6 until convergence or the iteration count T is completed

4. Summary

This article primarily focuses on the algorithm processes and what each step entails; some theoretical explanations may not be sufficiently clear. Interestingly, apart from t-SNE and autoencoders, the other dimensionality reduction algorithms are based on constructing a certain matrix and performing eigenvalue decomposition on it to obtain related ZZ or WW. The Laplacian Eigenmaps have not been thoroughly studied, but the algorithm also selects the top d smallest non-zero eigenvalues at the end, which is quite interesting. It is difficult to understand why the eigenvalue-based methods yield such good results without a solid mathematical foundation. Comparing a single-layer autoencoder with PCA, assuming the objective function of the autoencoder minimizes the mean squared error, although the autoencoder does not have as strong constraints as PCA (which requires orthogonality for each dimension), it may still learn well since maximizing the trace of covariance is equivalent to minimizing the mean squared error estimation. These methods always seem to exhibit some potential connections, and it remains to be seen whether a unified model can be extracted to resolve the dimensionality reduction issue.

5. Code Appendix

There is a wealth of varying-quality information available online regarding various dimensionality reduction algorithms, and most do not provide source code. Here is a GitHub project (accessible through the original text) that organizes 11 classic data extraction (dimensionality reduction) algorithms implemented in Python, including PCA, LDA, MDS, LLE, TSNE, etc., along with related materials and demonstration effects; it is highly suitable for beginners in machine learning and those just starting in data mining.

Editor: Huang Jiyan

Leave a Comment Cancel reply