
Source: Algorithm Advancement
This article is about 7500 words long and is recommended for a 10-minute read.
It provides an information table of seven algorithms, summarizing the parameters and main objectives of each algorithm, and introduces some basic concepts of dimensionality reduction.
-
3.1 Principal Component Analysis (PCA) -
3.2 Multidimensional Scaling (MDS) -
3.3 Linear Discriminant Analysis (LDA) -
3.4 Isometric Mapping (Isomap) -
3.5 Locally Linear Embedding (LLE) -
3.6 t-SNE -
3.7 Deep Autoencoder Networks










PCA was invented by Karl Pearson in 1901 and is a linear dimensionality reduction method. A point in high-dimensional space (with dimension D)is mapped to a point in low-dimensional space (with dimension d, d < D) by multiplying with matrix W
, where W has dimensions D*d, and i corresponds to the i-th sample point. Thus, we can obtain N points mapped from D-dimensional space to d-dimensional space. The goal of PCA is to separate the mapped points as much as possible, meaning that the variance of the N points
should be maximized. If the mean of each dimension in the D-dimensional space is 0, that is,
, then multiplying both sides by
results in the reduced data also having a mean of 0. Considering a matrix
, this matrix is the covariance matrix of the D-dimensional data. The diagonal values represent the variance in each dimension of D, while the off-diagonal elements represent the covariance between two dimensions in D.
Now, for the covariance matrix of the reduced d-dimensional data, if we want the points in the reduced dimension to be as separated as possible, we want the diagonal values of B, which represent the variance in each dimension, to be as large as possible. A larger variance indicates that the data in these dimensions has good discriminability, and we also want each dimension of d to be orthogonal; orthogonality ensures that two dimensions are independent and do not contain overlapping information, which best represents the data, with each dimension having sufficient discriminability and different information. In this case, all off-diagonal values of B should be zero. It can also be derived that:
This equation indicates that the role of the linear transformation matrix W in the PCA algorithm is to diagonalize the original covariance matrix C. Since diagonalization in linear algebra is achieved by solving for eigenvalues and corresponding eigenvectors, we can derive the PCA algorithm process (the process is mainly extracted from Professor Zhou Zhihua’s book “Machine Learning”, which adds objectives and hypotheses for comparison with subsequent algorithms. The derivation in Professor Zhou’s book is based on the Lagrange multiplier method, which is essentially the same as [3]. I highly recommend this blog post that discusses the mathematical principles of PCA [3]).
-
Input: N D-dimensional vectors , reduced to d dimensions
-
Output: Projection matrix , where each
is a D-dimensional column vector
-
Goal: To separate the data as much as possible after projection, (the trace here is because the above-mentioned off-diagonal elements of B are all 0, while the diagonal elements are precisely the variances of each dimension)
-
Hypothesis: The variance of each dimension of the reduced data is as large as possible, and each dimension is orthogonal -
1. Center each dimension of the input to 0 -
2. Calculate the covariance matrix of the input
-
3. Perform eigenvalue decomposition on the covariance matrix C -
4. Take the top d eigenvalues and their corresponding eigenvectors
In addition, PCA has many variants such as kernel PCA, probabilistic PCA, etc. This article only considers the simplest version of PCA for now.
3.2 Multidimensional Scaling (MDS)
The goal of MDS is to preserve the dissimilarity of the data during dimensionality reduction, which can also be understood as keeping the distance relationships in high-dimensional space unchanged in low-dimensional space. The distances here are represented by a matrix, where the pairwise distances of N samples are represented by each item in matrix A, and it is assumed that the distances in low-dimensional space are Euclidean distances. The reduced data is represented as
, thus we have
. The three items on the right are uniformly represented by the inner product matrix E
. After centering, each row and column sum of E is 0, leading to the derivation of
whereis the identity matrix I minus the all-ones matrix
, where i⋅ and ⋅j refer to the sum of a certain row or column, thus establishing the relationship between the distance matrix A and the inner product matrix E. Therefore, knowing A allows us to solve for E, and then by performing eigenvalue decomposition on E, we can represent the data under all eigenvalues as
, and when selecting d largest eigenvalues, we can approximate the distance matrix in d-dimensional space to that in high-dimensional space D. The MDS process is as follows [2]:
-
Input: Distance matrix , superscript indicates matrix size, original data is D-dimensional, reduced to d-dimensional
-
Output: Reduced matrix -
Goal: To maintain the relative relationships between the data during dimensionality reduction -
Hypothesis: The distance matrix of N samples is known -
1. Calculate $a{iullet}, a{ullet j}, a_{ullet ullet}$ -
2. Calculate the inner product matrix E -
3. Perform eigenvalue decomposition on E -
4. Take the d largest eigenvalues to form , with the corresponding eigenvectors arranged in order to form
3.3 Linear Discriminant Analysis (LDA)
LDA was initially proposed by Fisher in 1936 to solve binary classification problems. Since the calculation process effectively reduces the dimensionality of the data, it can also be used for supervised linear dimensionality reduction. It projects high-dimensional space data onto low-dimensional space to determine the class to which each sample belongs, considering the case of K classes. Its goal is to classify samples as accurately as possible into K classes, which is reflected in the proximity of projection points of the same class and the distance between projection points of different classes. This is different from PCA, where PCA aims to separate all samples in a certain dimension, whereas LDA’s low-dimensional projections may overlap, which PCA does not allow. The dimensionality reduction approach it employs is similar to PCA, both using matrix multiplication for linear dimensionality reduction, with the projection point being. Assuming a projection direction as shown in the diagram below, the projection centers correspond to the original high-dimensional points μ1 and μ2. Since we want samples belonging to different classes to be as far apart as possible, we hope that the projected centers are as distant as possible, meaning the objective is
, but having the centers far apart is not enough; for example, projecting perpendicular to the x1 axis may result in the centers being far apart, but some samples overlap in the projection space, so an additional optimization objective is needed: the projection points of the same class should be as close as possible. Therefore, the within-class scatter of the same class samples should be minimized, and the covariance matrix for the same class samples is as follows.


Where μ1=(u1,…,uN), W=(w1,…,wd),represents the set of samples belonging to class 1, and the middle matrix
is the within-class scatter matrix of the original data for K classes, while the distance between the centers of the two classes is calculated by subtracting the centers directly, and the distance between the projection centers of K classes needs to first calculate the overall sample center
(
indicates the number of samples belonging to class k), and the inter-class scatter matrix is calculated by summing the covariance matrices of all original data samples
. To summarize, the optimization objective of the LDA algorithm is to maximize the following cost function:
In the case of binary classification, W has dimensions D*1, meaning J(W) is a scalar. For K classes, W has dimensions D*(d-1), and the optimization goal is to compute the trace of the matrix. It can be observed that LDA does not center the data; if the centers of each class were centered, they would overlap, which is why this algorithm does not perform centering. Then, taking the derivative of J(W) with respect to W, this expression indicates that the solution for W iswhich is composed of the d-1 largest eigenvalues corresponding to the eigenvectors. Thus, we can use W to reduce X.

-
Input: N D-dimensional vectors , data can be divided into d classes
-
Output: Projection matrix $W=(w1, …, w{d-1})$, where each $w_i$ is a D-dimensional column vector -
Goal: Minimize the covariance between samples of the same class after projection, while maximizing the distance between centers of different classes -
Hypothesis: The optimization goal is to maximize -
1. Calculate the within-class scatter matrix Sw and the between-class scatter matrix Sb -
2. Perform singular value decomposition to obtain
-
3. Perform eigenvalue decomposition on the matrix -
4. Take the top d-1 eigenvalues and their corresponding eigenvectors $w1,…,w{d-1}$
I personally believe that the optimization objective here reflects a hypothesis, namely that the expressions in the optimization objectives above and below are diagonal matrices, and that the transformation W makes Sd and Sw both diagonal matrices.
3.4 Isometric Mapping (Isomap)
The previously mentioned MDS only performs dimensionality reduction; it requires the known distance relationships in high-dimensional space and cannot reflect the potential manifold of high-dimensional data. However, it can combine the basic ideas of manifold learning with MDS for dimensionality reduction [5]. That is, the local distances in high-dimensional space can be calculated using Euclidean distances. For the distance matrix A of MDS, the distance between two adjacent points iswhich is their Euclidean distance. Points that are relatively close are determined by the shortest path algorithm, while the distance between points that are further apart is set to Aij = ∞. Once the matrix A is established, it involves determining which points are neighbors; Isomap uses KNN to identify neighboring points. The overall algorithm process is as follows:
-
Input: N D-dimensional vectors , each point has K neighboring points, mapped to d dimensions
-
Output: Reduced matrix -
Goal: To preserve the manifold of high-dimensional data during dimensionality reduction -
Hypothesis: The local distances between two points in high-dimensional space can be calculated using Euclidean distances -
1. Use KNN to construct part of A, that is, find the neighboring points and fill in their Euclidean distances in Aij, initializing the others to infinity -
2. Use the shortest path algorithm (Dijkstra’s algorithm) to find the paths between relatively close points and fill in the distances -
3. Use the distance matrix A as input for MDS to obtain the output
3.5 Locally Linear Embedding (LLE)
As previously mentioned, the local regions of manifold learning possess Euclidean properties. In LLE, it is assumed that the coordinates of a point xi can be expressed as a linear combination of its surrounding points, that is, (where Xi represents the set of points in the neighborhood of xi), which is also a representation in high-dimensional space. Since this relationship is also preserved in low-dimensional space,
the weights in both equations are the same.
Based on the above assumptions, the first step is to solve for these weights. Assuming each sample point is derived from K surrounding samples, the size of the linear combination weights should be 1*K. The weights are obtained by minimizing the reconstruction error, and the objective function is solved with respect to f.
After obtaining the weights, substitute them into the optimization objective for low-dimensional space:
to solve for Z, where F is arranged in N*K, with restrictions applied to Z. The Lagrange multiplier method yields the form MZ=λY, thus solving for Z through eigenvalue decomposition of M.
-
Input: N D-dimensional vectors , each point has K neighboring points, mapped to d dimensions
-
Output: Reduced matrix Z -
Goal: To preserve the manifold of high-dimensional data during dimensionality reduction -
Hypothesis: The coordinates of a point in high-dimensional space are a linear combination of its K neighbors, and the dimensions in low-dimensional space are orthogonal -
1. Use KNN to construct part of A, that is, find K neighboring points and then calculate matrices F and M -
2. Perform eigenvalue decomposition on M -
3. Take the d smallest non-zero eigenvalues and their corresponding eigenvectors to form Z (since the objective is to minimize, the smallest eigenvalues are taken)
3.6 t-SNE
t-SNE is also a method for reducing high-dimensional data to two or three-dimensional space. It was proposed by Maaten in 2008 [6], based on the Stochastic Neighbor Embedding (SNE) method proposed by Hinton in 2002. The main idea is to assume that any two points in high-dimensional space, xj, follow a Gaussian distribution centered at xi with a standard deviation of σi, and similarly, xi follows a Gaussian distribution centered at xj with a standard deviation of σj. Thus, the conditional probability of xj being similar to xi is given by
This indicates the proportion of the probability of xj under the Gaussian distribution centered at xi to the total probability of all samples under the Gaussian distribution centered at xi, reflecting the similarity between the two from the perspective of xi. Then, let pij =(pi|j+pj|i)/2n use this probability as the joint probability of the similarity between the two points among all samples. The formula is as follows; the paper does not explain whether σ is a scalar or a vector, but since in subsequent calculations, pij is not directly obtained from the joint probability formula below, but rather through the earlier conditional probabilities, the earlier equations calculate a σi for each sample i, providing a specific determined value, where
. Then, by binary search, the corresponding σi for xi is determined such that substituting into the above two equations equals the value of Prep. Therefore, σ is likely to be a vector. It is unlikely that all samples share a single Gaussian distribution parameter.
Simultaneously, the mutual relationship or similarity between two points in low-dimensional space is also represented by joint probabilities. It is assumed that the Euclidean distance between two points in low-dimensional space follows a Student’s t-distribution with one degree of freedom, thus the probability of two points’ distances in low-dimensional space is represented as their joint probability among all two-point distance probabilities.
If the computed similarity values pij and qij in high-dimensional space xi, xj are equal to those in low-dimensional space zi, zj, it indicates that the points in low-dimensional space can accurately reflect the relative positional relationships in high-dimensional space. Therefore, the goal of t-SNE is to find a set of reduced representations that minimize the difference between pij and qij. Thus, t-SNE employs Kullback-Leibler divergence (KL divergence) to construct the objective function, which can be used to measure the differences between two probability distributions. It uses gradient descent to solve for the low-dimensional representations corresponding to the input data zi, treating the objective function as a variable to optimize, thus obtaining the gradient of zi for each iteration
, and then updating zi iteratively. In the actual update process, a momentum term is added for optimization acceleration, and the approximate algorithm process is as follows:
-
Input: N D-dimensional vectors , mapped to two or three dimensions, with fixed values Perp, iteration count T, learning rate η, and momentum term coefficient α(t)
-
Output: Reduced data representations z1,…,zN -
Goal: To visualize in two or three dimensions (the focus is on visualization) -
Hypothesis: In high-dimensional space, the value of a point xj follows a Gaussian distribution centered at another point xi. In low-dimensional space, the Euclidean distance between two points follows a t-distribution with one degree of freedom -
1. Use binary search to determine σi corresponding to xi -
2. Calculate pairwise $P{j|i}$, obtaining $p{ij} = (p{j|i}+p{i|j})/2$ -
3. Initialize z1,…,zN -
4. Calculate qij -
5. Calculate the gradient ∂J/∂zi -
6. Update
-
7. Repeat steps 4-6 until convergence or the iteration count T is completed
It is important to note that this algorithm treats low-dimensional data as variables for iteration. Therefore, if new data needs to be inserted, it cannot be directly operated on; instead, the new data must be added to the original data, and the entire calculation must be redone. Hence, the main function of t-SNE is still visualization.
3.7 Deep Autoencoder Networks
Autoencoders are a type of neural network that is an unsupervised algorithm, capable of both dimensionality reduction and automatically learning certain features from data. The principle of this neural network is that it takes a set of values as input and produces an output that is as close as possible to the input values. The network consists of fully connected layers, where each node in each layer connects to all nodes in the previous layer. The structure of the Autoencoder is shown in Figure 4 below, where the encoder network performs normal forward propagation z=W x+b, and the decoder network’s propagation parameters are the transposed layer parameters in symmetric structure, resulting in the output being, and finally, when it propagates to a layer equal to the number of inputs, it obtains a set of values x′, with the network hoping these two values are equal x′=x. The difference between this value and the true input x is obtained through cross-entropy or mean squared error, yielding the reconstruction error cost function, which is minimized through gradient descent to allow the network to learn the correct parameters. Thus, this network first projects the high-dimensional data into low-dimensional space through the “encoder” network, and then reverses the low-dimensional data back to high-dimensional space through the “decoder” network.


Figure 5 Inter-layer Structure of Autoencoder
In 2006, Hinton published an article in Science discussing how to use deep learning’s autoencoder networks for dimensionality reduction [8], primarily proposing a method for effectively initializing weights to address the issue of the quality of autoencoder dimensionality reduction being dependent on the initialization of network weights. The above expression does not include non-linear transformations; in the actual network, each layer performs matrix multiplication with weights followed by a non-linear transformation. Additionally, the autoencoder model can incorporate sparsity properties [9], meaning that for N D-dimensional inputs, the sum of output values of a certain node in a certain layer approaches 0, that is, , where l indicates which layer it is, and i indicates which input it is. Regularization terms can also be added to impose constraints on weights.
-
Input: N D-dimensional vectors x1,…,xN, network structure specifying the number of nodes in each layer, iteration count T, learning rate η -
Output: Reduced data representations z1,…,zN -
Goal: To learn the internal properties or structures of the data to reconstruct the input data -
Hypothesis: The neural network is exceptionally powerful and can learn features effectively. -
1. Set the number of layers and the number of nodes in each layer -
2. Initialize the weight parameters -
3. Perform forward propagation to calculate the values of the next layer z=W x+b -
4. Perform backward propagation to calculate the values of the previous layer -
5. Calculate the gradients for each layer with respect to the input and the layer parameters W, using backpropagation to propagate the error throughout the network -
6. Sum the gradients for W and update accordingly -
7. Repeat steps 3-6 until convergence or the iteration count T is completed