A Comprehensive Guide to Optimization Algorithms in Machine Learning

Introduction

For almost all machine learning algorithms, whether supervised learning, unsupervised learning, or reinforcement learning, the final goal generally boils down to solving an optimization problem. Therefore, optimization methods occupy a central position in the derivation and implementation of machine learning algorithms. In this article, the author will provide a comprehensive summary of the optimization algorithms used in machine learning and clarify their direct relationships, helping you understand this part of knowledge from a global perspective.
Mathematical Models Required by Machine Learning
Almost all machine learning algorithms ultimately reduce to finding the extreme value of an objective function, that is, an optimization problem. For example, in supervised learning, we want to find the best mapping function f(x) that minimizes the loss function for the training samples (minimizing empirical risk or structural risk):
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Here, N is the number of training samples, L is the loss function for a single sample, w is the model parameters to be solved, which are the parameters of the mapping function, xi is the feature vector of the sample, and yi is the label value of the sample.
Or find an optimal probability density function p(x) that maximizes the log-likelihood function for the training samples (maximum likelihood estimation):
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Here, θ is the model parameters to be solved, which are the parameters of the probability density function.
For unsupervised learning, taking clustering algorithms as an example, the algorithm aims to minimize the sum of the distances of each class sample from the class center:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Here, k is the number of types, x is the sample vector, μi is the class center vector, and Si is the sample set of the i-th class.
For reinforcement learning, we want to find an optimal policy, that is, the mapping function from state s to action a (deterministic policy; for non-deterministic policies, it is the probability of executing each action):
A Comprehensive Guide to Optimization Algorithms in Machine Learning
So that for any given state, after executing the action a determined by this policy function, the cumulative reward is maximized:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Here, the state value function is used.
Overall, the core goal of machine learning is to provide a model (generally a mapping function), then define an evaluation function (objective function) for this model, and solve the extreme value of the objective function to determine the model parameters, thus obtaining the desired model. In these three key steps, the first two are the issues that machine learning needs to study, establishing mathematical models. The third issue is a pure mathematical problem, that is, optimization methods, which is the core of this article.
Classification of Optimization Algorithms
For optimization objective functions of various forms and characteristics of machine learning algorithms, we have found various solving algorithms suitable for them. Except for a few problems that can be solved by brute force search to obtain the optimal solution, we divide the optimization algorithms used in machine learning into two types (this article does not consider stochastic optimization algorithms such as simulated annealing, genetic algorithms, etc.):
    • Analytical Solution

    • Numerical Optimization

The former provides an exact analytical solution to an optimization problem, also called an analytical solution, which is generally a theoretical result. The latter is used to approximate the optimal point using numerical methods when it is very difficult to provide an exact calculation formula for the extreme point. In addition, there are some other solving ideas, such as divide and conquer, dynamic programming, etc. A good optimization algorithm needs to meet:
    • Can correctly find the extreme points in various situations

    • Fast speed

The following diagram shows the classification of these algorithms and their relationships:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Next, we will elaborate according to this diagram.
Fermat’s Theorem
For a differentiable function, the unified approach to finding its extreme value is to find the point where the derivative is 0, that is, Fermat’s theorem. This theorem in calculus states that for a differentiable function, the derivative at the extreme point must be 0:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
For multivariable functions, the gradient must be 0:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Points where the derivative is 0 are called stationary points. It is important to note that a derivative of 0 is only a necessary condition for a function to achieve an extreme value, not a sufficient condition; it is merely a suspected extreme point. Whether it is an extreme value, whether it is a maximum or minimum, also requires looking at higher-order derivatives. For a univariate function, assume x is a stationary point:
  • If f”(x) > 0, then it is a minimum at that point

  • If f”(x) < 0, then it is a maximum at that point

  • If f”(x) >= 0, we also need to look at higher-order derivatives

For multivariable functions, assume x is a stationary point:
  • If the Hessian matrix is positive definite, the function has a minimum at that point

  • If the Hessian matrix is negative definite, the function has a maximum at that point

  • If the Hessian matrix is indefinite, we still need to look at further (this part is incorrect)

At points where the derivative is 0, the function may not take extreme values, which is called a saddle point. The following diagram is an example of a saddle point (from SIGAI Cloud Laboratory):
A Comprehensive Guide to Optimization Algorithms in Machine Learning
In addition to saddle points, optimization algorithms may also encounter another issue: the local extreme value problem, that is, a stationary point is an extreme value point, but not a global extreme value. If we impose constraints on the optimization problem, we can effectively avoid these two problems. A typical case is convex optimization, which requires that the feasible domain of the optimization variables is a convex set and the objective function is a convex function.
Although stationary points are only necessary conditions for a function to achieve extreme values and not sufficient conditions, if we find stationary points, it is much easier to determine and filter whether they are extreme value points. Whether theoretical results or numerical optimization algorithms, they generally aim to find stationary points as the goal of finding extreme points. For univariate functions, first take the derivative, then solve the equation where the derivative is 0 to find all stationary points. For multivariable functions, take partial derivatives with respect to each variable, set them to 0, and solve the system of equations to reach all stationary points. These are the basic methods taught in calculus. Fortunately, in machine learning, many objective functions are differentiable, so we can use this set of methods.
Lagrange Multiplier Method
Fermat’s theorem provides the necessary condition for the extreme value of a function without constraints. For some practical application problems, there are generally equality or inequality constraints. The classic solution for extreme value problems with equality constraints is the Lagrange multiplier method.
For the following problem:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Construct the Lagrange multiplier function:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
At the optimal point, the derivatives of x and the multiplier variables λi must both be 0:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Solve this equation to get the optimal solution. For a more detailed explanation of the Lagrange multiplier method, you can read any advanced mathematics textbook. The places where the Lagrange multiplier method is used in machine learning include:
  • Principal Component Analysis

  • Linear Discriminant Analysis

  • Laplacian Eigenmaps in Manifold Learning

  • Hidden Markov Model

KKT Conditions
The KKT conditions are a generalization of the Lagrange multiplier method, used to solve functions with both equality and inequality constraints. For the following optimization problem:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Similar to the Lagrange dual approach, the KKT conditions construct the following multiplier function:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
λ and μ are called KKT multipliers. At the optimal solution x*, the following conditions should be satisfied:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
The equality constraint hj(x*)=0 and the inequality constraint gk(x*)<=0 are the constraints that should be satisfied, ▽xL(x*)=0, which is the same as the previous Lagrange multiplier method. The only additional condition is about gi(x):
A Comprehensive Guide to Optimization Algorithms in Machine Learning
The KKT conditions are only necessary conditions for achieving extreme values, not sufficient conditions. The places where KKT conditions are used in machine learning include:
  • Support Vector Machine (SVM)

Numerical Optimization Algorithms
The three methods described above can be used in theoretical derivation or in cases where certain root-finding formulas can be obtained (such as linear functions, maximum likelihood estimation of normal distributions), but for the vast majority of functions, the equations where the gradient equals 0 cannot be directly solved, such as equations containing transcendental functions like exponential or logarithmic functions. For such equations that cannot be solved directly, we can only use approximate algorithms to solve, that is, numerical optimization algorithms. These numerical optimization algorithms generally utilize the derivative information of the objective function, such as first-order derivatives and second-order derivatives. If first-order derivatives are used, it is called first-order optimization algorithms. If second-order derivatives are used, it is called second-order optimization algorithms.
In engineering implementation, iterative methods are usually adopted, starting from an initial point x0 and repeatedly using some rules to move from xk to the next point xk+1, constructing such a sequence until it converges to the point where the gradient is 0. That is, the following limit holds:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
These rules generally utilize first-order derivative information, namely the gradient; or second-order derivative information, namely the Hessian matrix. Thus, the core of the iterative method is to obtain such an iterative formula determined by the previous point to determine the next point:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Gradient Descent Method
The gradient descent method searches in the opposite direction of the gradient, utilizing the first-order derivative information of the function. The iterative formula for the gradient descent method is:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
According to the first-order Taylor expansion of the function, in the negative gradient direction, the function value decreases. As long as the learning rate is set small enough and does not reach the point where the gradient is 0, the function value will definitely decrease at each iteration. The reason for setting the learning rate to a very small positive number is to ensure that the value of xk+1 after iteration lies within the neighborhood of the previous value xk, thus allowing the higher-order terms in the Taylor expansion to be ignored and ensuring that the function value decreases during iteration.
The gradient descent method and its variants are widely used in machine learning, especially in deep learning.
Momentum Term
To accelerate the convergence speed of the gradient descent method and reduce oscillation, a momentum term is introduced. The momentum term accumulates the gradient values from previous iterations, and the parameter update formula after adding this term is:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Where Vt+1 is the momentum term, which replaces the previous gradient term. The calculation formula for the momentum term is:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
It is the weighted average of the previous momentum term and the current gradient value, where α is the learning rate and μ is the momentum term coefficient. If expanded according to time t, then at the t-th iteration, all gradient values from the 1st to the t-th iteration are used, and the old gradient values decay exponentially with the coefficient μt:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
The momentum term accumulates the gradient values from previous iterations, allowing this iteration to move forward in the direction of previous inertia.
AdaGrad Algorithm
The AdaGrad algorithm is the most direct improvement of the gradient descent method. The gradient descent method relies on a manually set learning rate; if it is set too small, convergence is too slow, and if it is set too large, it may cause the algorithm not to converge, making it very difficult to set a suitable value for this learning rate.
The AdaGrad algorithm dynamically adjusts the learning rate based on the historical gradient values from previous iterations, and each component xi of the optimization variable vector X has its own learning rate. The parameter update formula is:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Where α is the learning factor, gt is the gradient vector of the parameters at the t-th iteration, ε is a very small positive number to avoid division by zero, and the subscript i indicates the component of the vector. The only difference from standard gradient descent is the additional term in the denominator, which accumulates the historical gradient values up to this iteration to generate the coefficient value for gradient descent. According to the above formula, the larger the absolute value of historical derivative values, the smaller the component learning rate, and vice versa. Although it achieves an adaptive learning rate, this algorithm still has problems: it requires manual setting of a global learning rate α, and as time accumulates, the denominator in the above formula will become larger and larger, leading the learning rate to approach 0, making it impossible to effectively update parameters.
RMSProp Algorithm
The RMSProp algorithm is an improvement over AdaGrad, avoiding the problem of the learning rate approaching 0 due to the long-term accumulation of gradient values. The specific approach is to construct a vector RMS from the gradient values, initialized to 0, and accumulate the historical squared gradient values according to a decay coefficient. The update formula is:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
AdaGrad directly accumulates all historical squared gradients, while here, the historical squared gradient values are accumulated after decaying according to δt. The parameter update formula is:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Where δ is a manually set parameter, and like AdaGrad, a global learning rate α also needs to be manually specified.
AdaDelta Algorithm
The AdaDelta algorithm is also an improvement over AdaGrad, avoiding the problem of the learning rate approaching 0 due to the long-term accumulation of gradient values, and additionally, it removes the dependency on manually setting a global learning rate. Assume the parameter to be optimized is x, and the gradient value calculated during the t-th iteration of the gradient descent method is gt. The algorithm first initializes the following two vectors to zero vectors:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Where E[g2] is the cumulative value of squared gradients (calculated separately for each component), and the update formula is:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Here, g2 is the square of each element of the vector, and all subsequent calculation formulas are performed for each component of the vector. Next, the following RMS quantity is calculated:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
This is also a vector, calculated separately for each component of the vector. Then, the update value of the parameters is calculated:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
The calculation formula for RMS[Δx]t-1 is similar to this. This update value is also constructed through the gradient, except that the learning rate is determined by the historical values of the gradient. The iterative formula for parameter updates is:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Adam Algorithm
The Adam algorithm integrates adaptive learning rates with momentum terms. The algorithm constructs two vectors m and v from the gradient, the former being the momentum term, and the latter accumulating the sum of squared gradients, used to construct adaptive learning rates. Their initial values are set to 0, and the update formulas are:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Where β1, β2 are manually specified parameters, and i is the index of the vector component. Relying on these two values, the parameter update value is constructed, and the parameter update formula is:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Here, m is similar to the momentum term, and v is used to construct the learning rate.
Stochastic Gradient Descent Method
Assuming there are N samples in the training sample set, the goal of the supervised learning algorithm during training is to optimize the average loss function over this dataset:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Where L(w,xi,yi) is the loss function for a single training sample (xi,yi), w is the parameter to be learned, r(w) is the regularization term, and λ is the weight of the regularization term. When the number of training samples is very large, if all samples are used in each iteration during training, the computational cost is too high. As an improvement, we can select a batch of samples in each iteration and define the loss function based on these samples.
The batch stochastic gradient descent method uses a random approximation of the above objective function in each iteration, that is, only M < N randomly selected samples are used to approximate the loss function. The objective function to be optimized in each iteration becomes:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
The stochastic gradient descent method converges in a probabilistic sense.
Newton’s Method
Newton’s method is a second-order optimization technique that utilizes both first-order and second-order derivative information to directly find points where the gradient is 0. The iterative formula for Newton’s method is:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Where H is the Hessian matrix and g is the gradient vector. Newton’s method cannot guarantee that the function value decreases at each iteration, nor can it guarantee convergence to a minimum point. In implementation, a learning rate also needs to be set, similar to the gradient descent method, to ignore higher-order terms in the Taylor expansion. The learning rate is usually set using line search techniques.
In implementation, we generally do not directly compute the inverse of the Hessian matrix, but instead solve the following linear equation:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
The solution d is called the Newton direction. The determination for terminating the iteration is based on whether the gradient value is sufficiently close to 0 or whether the maximum specified number of iterations has been reached.
Newton’s method has a faster convergence speed than the gradient descent method, but requires computing the Hessian matrix at each iteration and solving a linear equation, which is computationally intensive. Additionally, if the Hessian matrix is not invertible, this method fails.
Newton’s method has practical applications in machine learning algorithms such as logistic regression and AdaBoost.
Broyden’s Method
Newton’s method requires calculating the Hessian matrix at each iteration and solving a linear equation with that matrix as the coefficient matrix, and the Hessian matrix may be non-invertible. To address this, some improved methods have been proposed, a typical representative being Broyden’s method. The idea of Broyden’s method is not to compute the Hessian matrix of the objective function and then invert it, but to obtain an approximate inverse Hessian matrix through other means. The specific approach is to construct a positive definite symmetric matrix that approximates the Hessian matrix or its inverse, and use that matrix for the Newton’s method iteration.
All these main numerical optimization algorithms can be freely experimented with on the SIGAI Cloud Laboratory:
www.sigai.cn
By constructing the objective function and specifying the parameters and initial values of the optimization algorithm, the running process of the algorithm can be visualized, and the solving results under different parameters can be compared.
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Trust Region Newton’s Method
The standard Newton’s method may not converge to an optimal solution and cannot guarantee that the function value will decrease according to the iteration sequence. This problem can be solved by adjusting the step size of the Newton direction; currently, two commonly used methods are line search and trust region methods. The trust region Newton’s method is a variant of the truncated Newton method, used to solve optimization problems with bounded constraints. In each iteration of the trust region Newton’s method, there is an iteration sequence xk, a trust region size Δk, and a quadratic objective function:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
This expression can be obtained through Taylor expansion, ignoring higher-order terms:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
The algorithm seeks an sk that approximates the minimization of qk(S) under the constraint ||S|| <= Δk. Next, the following ratio is checked to update wk and Δk:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
This is the ratio of the actual decrease in function value to the decrease in function value predicted by the quadratic approximation model. Based on the previous calculation results, the size of the trust region is dynamically adjusted.
The trust region Newton’s method has practical applications in logistic regression and linear support vector solving, and specific details can be read in the liblinear open-source library.
Divide and Conquer Method
The divide and conquer method is an algorithm design idea that decomposes a large problem into smaller subproblems for solving. The solution to the entire problem is constructed based on the solutions to the subproblems. In optimization methods, the specific approach is to adjust only part of the components of the optimization vector during each iteration while keeping the other components fixed.
Coordinate Descent Method
The basic idea of the coordinate descent method is to optimize one variable at a time; this is a form of divide and conquer. Assume the optimization problem to be solved is;
A Comprehensive Guide to Optimization Algorithms in Machine Learning
The process of solving using the coordinate descent method is to select one component xi for optimization while keeping the other components fixed, thus converting the extreme value problem of a multivariable function into an extreme value problem of a univariate function. If the problem to be solved is large in scale, this approach can effectively speed up the process.
The coordinate descent method has practical applications in solving logistic regression and linear support vectors, and specific details can be read in the liblinear open-source library.
SMO Algorithm
The SMO algorithm is also a divide and conquer method, used to solve the dual problem of support vector machines. The dual problem with slack variables and kernel functions is:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
The core idea of the SMO algorithm is to optimize two components αi and αj from the optimization variables at a time while keeping the other components fixed, ensuring that the equality constraint conditions are satisfied. The reason for choosing to optimize two variables instead of one is that there are equality constraints; adjusting only one variable would violate the equality constraint.
Assuming the selected two components are αi and αj, while keeping the other components fixed as constants, the target function for these two variables is a bivariate quadratic function. This problem also has equality and inequality constraint conditions. The exact solution to this subproblem can be directly obtained, which is the extreme value of a univariate quadratic function within a certain interval.
Stage-wise Optimization
Stage-wise optimization involves fixing some components of the optimization variable X during each iteration while optimizing another part of the variable b; then fixing b while optimizing a. This process continues until convergence to the optimal solution.
The AdaBoost algorithm is a typical representative of this method. The AdaBoost algorithm uses an exponential loss function during training:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Since the strong classifier is a weighted sum of multiple weak classifiers, substituting the above loss function gives the objective function to be optimized during algorithm training:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Here, the exponential loss function is split into two parts: the existing strong classifier Fj−1 and the loss function of the current weak classifier f on the training samples; the former has already been obtained in previous iterations, so it can be treated as a constant. Thus, the objective function can be simplified to:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
Where:
A Comprehensive Guide to Optimization Algorithms in Machine Learning
This problem can be solved in two steps: first treating the weights of the weak classifier β as constants to obtain the optimal weak classifier f. After obtaining the weak classifier, the weight coefficient β is then optimized.
Dynamic Programming Algorithm
Dynamic programming is also a solving idea that decomposes a problem into subproblems. If a certain solution to the entire problem is optimal, then any part of that solution is also optimal for the subproblems. By solving the subproblems, the optimal solution is obtained, gradually extending to get the optimal solution for the entire problem.
Typical representatives of this type of method are the decoding algorithm (Viterbi algorithm) of hidden Markov models and dynamic programming algorithms in reinforcement learning. Such algorithms are generally used for optimization of discrete variables and are combinatorial optimization problems. The previously discussed derivative-based optimization algorithms cannot be used. Dynamic programming algorithms can efficiently solve such problems, and their foundation is the Bellman optimality principle. Once written in the recursive form of the optimization equation, algorithms can be constructed for solving.
Source: Internet
-END-

A Comprehensive Guide to Optimization Algorithms in Machine Learning

Leave a Comment