A Comprehensive Guide to Optimization Algorithms in Machine Learning

Introduction

For almost all machine learning algorithms, whether supervised learning, unsupervised learning, or reinforcement learning, the final goal generally boils down to solving an optimization problem. Therefore, optimization methods occupy a central position in the derivation and implementation of machine learning algorithms. In this article, the author will provide a comprehensive summary of the optimization algorithms used in machine learning and clarify their direct relationships, helping you understand this part of knowledge from a global perspective.

Mathematical Models Required by Machine Learning

Almost all machine learning algorithms ultimately reduce to finding the extreme value of an objective function, that is, an optimization problem. For example, in supervised learning, we want to find the best mapping function f(x) that minimizes the loss function for the training samples (minimizing empirical risk or structural risk):

A Comprehensive Guide to Optimization Algorithms in Machine Learning

Here, N is the number of training samples, L is the loss function for a single sample, w is the model parameters to be solved, which are the parameters of the mapping function, xi is the feature vector of the sample, and yi is the label value of the sample.

Or find an optimal probability density function p(x) that maximizes the log-likelihood function for the training samples (maximum likelihood estimation):

Here, θ is the model parameters to be solved, which are the parameters of the probability density function.

For unsupervised learning, taking clustering algorithms as an example, the algorithm aims to minimize the sum of the distances of each class sample from the class center:

Here, k is the number of types, x is the sample vector, μi is the class center vector, and Si is the sample set of the i-th class.

For reinforcement learning, we want to find an optimal policy, that is, the mapping function from state s to action a (deterministic policy; for non-deterministic policies, it is the probability of executing each action):

So that for any given state, after executing the action a determined by this policy function, the cumulative reward is maximized:

Here, the state value function is used.

Overall, the core goal of machine learning is to provide a model (generally a mapping function), then define an evaluation function (objective function) for this model, and solve the extreme value of the objective function to determine the model parameters, thus obtaining the desired model. In these three key steps, the first two are the issues that machine learning needs to study, establishing mathematical models. The third issue is a pure mathematical problem, that is, optimization methods, which is the core of this article.

Classification of Optimization Algorithms

For optimization objective functions of various forms and characteristics of machine learning algorithms, we have found various solving algorithms suitable for them. Except for a few problems that can be solved by brute force search to obtain the optimal solution, we divide the optimization algorithms used in machine learning into two types (this article does not consider stochastic optimization algorithms such as simulated annealing, genetic algorithms, etc.):

Analytical Solution
Numerical Optimization

The former provides an exact analytical solution to an optimization problem, also called an analytical solution, which is generally a theoretical result. The latter is used to approximate the optimal point using numerical methods when it is very difficult to provide an exact calculation formula for the extreme point. In addition, there are some other solving ideas, such as divide and conquer, dynamic programming, etc. A good optimization algorithm needs to meet:

Can correctly find the extreme points in various situations
Fast speed

The following diagram shows the classification of these algorithms and their relationships:

Next, we will elaborate according to this diagram.

Fermat’s Theorem

For a differentiable function, the unified approach to finding its extreme value is to find the point where the derivative is 0, that is, Fermat’s theorem. This theorem in calculus states that for a differentiable function, the derivative at the extreme point must be 0:

For multivariable functions, the gradient must be 0:

Points where the derivative is 0 are called stationary points. It is important to note that a derivative of 0 is only a necessary condition for a function to achieve an extreme value, not a sufficient condition; it is merely a suspected extreme point. Whether it is an extreme value, whether it is a maximum or minimum, also requires looking at higher-order derivatives. For a univariate function, assume x is a stationary point:

If f”(x) > 0, then it is a minimum at that point
If f”(x) < 0, then it is a maximum at that point
If f”(x) >= 0, we also need to look at higher-order derivatives

For multivariable functions, assume x is a stationary point:

If the Hessian matrix is positive definite, the function has a minimum at that point
If the Hessian matrix is negative definite, the function has a maximum at that point
If the Hessian matrix is indefinite, we still need to look at further (this part is incorrect)

At points where the derivative is 0, the function may not take extreme values, which is called a saddle point. The following diagram is an example of a saddle point (from SIGAI Cloud Laboratory):

In addition to saddle points, optimization algorithms may also encounter another issue: the local extreme value problem, that is, a stationary point is an extreme value point, but not a global extreme value. If we impose constraints on the optimization problem, we can effectively avoid these two problems. A typical case is convex optimization, which requires that the feasible domain of the optimization variables is a convex set and the objective function is a convex function.

Although stationary points are only necessary conditions for a function to achieve extreme values and not sufficient conditions, if we find stationary points, it is much easier to determine and filter whether they are extreme value points. Whether theoretical results or numerical optimization algorithms, they generally aim to find stationary points as the goal of finding extreme points. For univariate functions, first take the derivative, then solve the equation where the derivative is 0 to find all stationary points. For multivariable functions, take partial derivatives with respect to each variable, set them to 0, and solve the system of equations to reach all stationary points. These are the basic methods taught in calculus. Fortunately, in machine learning, many objective functions are differentiable, so we can use this set of methods.

Lagrange Multiplier Method

Fermat’s theorem provides the necessary condition for the extreme value of a function without constraints. For some practical application problems, there are generally equality or inequality constraints. The classic solution for extreme value problems with equality constraints is the Lagrange multiplier method.

For the following problem:

Construct the Lagrange multiplier function:

At the optimal point, the derivatives of x and the multiplier variables λi must both be 0:

Solve this equation to get the optimal solution. For a more detailed explanation of the Lagrange multiplier method, you can read any advanced mathematics textbook. The places where the Lagrange multiplier method is used in machine learning include:

Principal Component Analysis
Linear Discriminant Analysis
Laplacian Eigenmaps in Manifold Learning
Hidden Markov Model

KKT Conditions

The KKT conditions are a generalization of the Lagrange multiplier method, used to solve functions with both equality and inequality constraints. For the following optimization problem:

Similar to the Lagrange dual approach, the KKT conditions construct the following multiplier function:

λ and μ are called KKT multipliers. At the optimal solution x*, the following conditions should be satisfied:

The equality constraint hj(x*)=0 and the inequality constraint gk(x*)<=0 are the constraints that should be satisfied, ▽xL(x*)=0, which is the same as the previous Lagrange multiplier method. The only additional condition is about gi(x):

The KKT conditions are only necessary conditions for achieving extreme values, not sufficient conditions. The places where KKT conditions are used in machine learning include:

Support Vector Machine (SVM)

Numerical Optimization Algorithms

The three methods described above can be used in theoretical derivation or in cases where certain root-finding formulas can be obtained (such as linear functions, maximum likelihood estimation of normal distributions), but for the vast majority of functions, the equations where the gradient equals 0 cannot be directly solved, such as equations containing transcendental functions like exponential or logarithmic functions. For such equations that cannot be solved directly, we can only use approximate algorithms to solve, that is, numerical optimization algorithms. These numerical optimization algorithms generally utilize the derivative information of the objective function, such as first-order derivatives and second-order derivatives. If first-order derivatives are used, it is called first-order optimization algorithms. If second-order derivatives are used, it is called second-order optimization algorithms.

In engineering implementation, iterative methods are usually adopted, starting from an initial point x0 and repeatedly using some rules to move from xk to the next point xk+1, constructing such a sequence until it converges to the point where the gradient is 0. That is, the following limit holds:

These rules generally utilize first-order derivative information, namely the gradient; or second-order derivative information, namely the Hessian matrix. Thus, the core of the iterative method is to obtain such an iterative formula determined by the previous point to determine the next point:

Gradient Descent Method

The gradient descent method searches in the opposite direction of the gradient, utilizing the first-order derivative information of the function. The iterative formula for the gradient descent method is:

According to the first-order Taylor expansion of the function, in the negative gradient direction, the function value decreases. As long as the learning rate is set small enough and does not reach the point where the gradient is 0, the function value will definitely decrease at each iteration. The reason for setting the learning rate to a very small positive number is to ensure that the value of xk+1 after iteration lies within the neighborhood of the previous value xk, thus allowing the higher-order terms in the Taylor expansion to be ignored and ensuring that the function value decreases during iteration.

The gradient descent method and its variants are widely used in machine learning, especially in deep learning.

Momentum Term

To accelerate the convergence speed of the gradient descent method and reduce oscillation, a momentum term is introduced. The momentum term accumulates the gradient values from previous iterations, and the parameter update formula after adding this term is:

Where Vt+1 is the momentum term, which replaces the previous gradient term. The calculation formula for the momentum term is:

It is the weighted average of the previous momentum term and the current gradient value, where α is the learning rate and μ is the momentum term coefficient. If expanded according to time t, then at the t-th iteration, all gradient values from the 1st to the t-th iteration are used, and the old gradient values decay exponentially with the coefficient μt:

The momentum term accumulates the gradient values from previous iterations, allowing this iteration to move forward in the direction of previous inertia.

AdaGrad Algorithm

The AdaGrad algorithm is the most direct improvement of the gradient descent method. The gradient descent method relies on a manually set learning rate; if it is set too small, convergence is too slow, and if it is set too large, it may cause the algorithm not to converge, making it very difficult to set a suitable value for this learning rate.

The AdaGrad algorithm dynamically adjusts the learning rate based on the historical gradient values from previous iterations, and each component xi of the optimization variable vector X has its own learning rate. The parameter update formula is:

Where α is the learning factor, gt is the gradient vector of the parameters at the t-th iteration, ε is a very small positive number to avoid division by zero, and the subscript i indicates the component of the vector. The only difference from standard gradient descent is the additional term in the denominator, which accumulates the historical gradient values up to this iteration to generate the coefficient value for gradient descent. According to the above formula, the larger the absolute value of historical derivative values, the smaller the component learning rate, and vice versa. Although it achieves an adaptive learning rate, this algorithm still has problems: it requires manual setting of a global learning rate α, and as time accumulates, the denominator in the above formula will become larger and larger, leading the learning rate to approach 0, making it impossible to effectively update parameters.

RMSProp Algorithm

The RMSProp algorithm is an improvement over AdaGrad, avoiding the problem of the learning rate approaching 0 due to the long-term accumulation of gradient values. The specific approach is to construct a vector RMS from the gradient values, initialized to 0, and accumulate the historical squared gradient values according to a decay coefficient. The update formula is:

AdaGrad directly accumulates all historical squared gradients, while here, the historical squared gradient values are accumulated after decaying according to δt. The parameter update formula is:

Where δ is a manually set parameter, and like AdaGrad, a global learning rate α also needs to be manually specified.

AdaDelta Algorithm

The AdaDelta algorithm is also an improvement over AdaGrad, avoiding the problem of the learning rate approaching 0 due to the long-term accumulation of gradient values, and additionally, it removes the dependency on manually setting a global learning rate. Assume the parameter to be optimized is x, and the gradient value calculated during the t-th iteration of the gradient descent method is gt. The algorithm first initializes the following two vectors to zero vectors:

Where E[g2] is the cumulative value of squared gradients (calculated separately for each component), and the update formula is:

Here, g2 is the square of each element of the vector, and all subsequent calculation formulas are performed for each component of the vector. Next, the following RMS quantity is calculated:

This is also a vector, calculated separately for each component of the vector. Then, the update value of the parameters is calculated:

The calculation formula for RMS[Δx]t-1 is similar to this. This update value is also constructed through the gradient, except that the learning rate is determined by the historical values of the gradient. The iterative formula for parameter updates is:

Adam Algorithm

The Adam algorithm integrates adaptive learning rates with momentum terms. The algorithm constructs two vectors m and v from the gradient, the former being the momentum term, and the latter accumulating the sum of squared gradients, used to construct adaptive learning rates. Their initial values are set to 0, and the update formulas are:

Where β1, β2 are manually specified parameters, and i is the index of the vector component. Relying on these two values, the parameter update value is constructed, and the parameter update formula is:

Here, m is similar to the momentum term, and v is used to construct the learning rate.

Stochastic Gradient Descent Method

Assuming there are N samples in the training sample set, the goal of the supervised learning algorithm during training is to optimize the average loss function over this dataset:

Where L(w,xi,yi) is the loss function for a single training sample (xi,yi), w is the parameter to be learned, r(w) is the regularization term, and λ is the weight of the regularization term. When the number of training samples is very large, if all samples are used in each iteration during training, the computational cost is too high. As an improvement, we can select a batch of samples in each iteration and define the loss function based on these samples.

The batch stochastic gradient descent method uses a random approximation of the above objective function in each iteration, that is, only M < N randomly selected samples are used to approximate the loss function. The objective function to be optimized in each iteration becomes:

The stochastic gradient descent method converges in a probabilistic sense.

Newton’s Method

Newton’s method is a second-order optimization technique that utilizes both first-order and second-order derivative information to directly find points where the gradient is 0. The iterative formula for Newton’s method is:

Where H is the Hessian matrix and g is the gradient vector. Newton’s method cannot guarantee that the function value decreases at each iteration, nor can it guarantee convergence to a minimum point. In implementation, a learning rate also needs to be set, similar to the gradient descent method, to ignore higher-order terms in the Taylor expansion. The learning rate is usually set using line search techniques.

In implementation, we generally do not directly compute the inverse of the Hessian matrix, but instead solve the following linear equation:

The solution d is called the Newton direction. The determination for terminating the iteration is based on whether the gradient value is sufficiently close to 0 or whether the maximum specified number of iterations has been reached.

Newton’s method has a faster convergence speed than the gradient descent method, but requires computing the Hessian matrix at each iteration and solving a linear equation, which is computationally intensive. Additionally, if the Hessian matrix is not invertible, this method fails.

Newton’s method has practical applications in machine learning algorithms such as logistic regression and AdaBoost.

Broyden’s Method

Newton’s method requires calculating the Hessian matrix at each iteration and solving a linear equation with that matrix as the coefficient matrix, and the Hessian matrix may be non-invertible. To address this, some improved methods have been proposed, a typical representative being Broyden’s method. The idea of Broyden’s method is not to compute the Hessian matrix of the objective function and then invert it, but to obtain an approximate inverse Hessian matrix through other means. The specific approach is to construct a positive definite symmetric matrix that approximates the Hessian matrix or its inverse, and use that matrix for the Newton’s method iteration.

All these main numerical optimization algorithms can be freely experimented with on the SIGAI Cloud Laboratory:

www.sigai.cn

By constructing the objective function and specifying the parameters and initial values of the optimization algorithm, the running process of the algorithm can be visualized, and the solving results under different parameters can be compared.

Trust Region Newton’s Method

The standard Newton’s method may not converge to an optimal solution and cannot guarantee that the function value will decrease according to the iteration sequence. This problem can be solved by adjusting the step size of the Newton direction; currently, two commonly used methods are line search and trust region methods. The trust region Newton’s method is a variant of the truncated Newton method, used to solve optimization problems with bounded constraints. In each iteration of the trust region Newton’s method, there is an iteration sequence xk, a trust region size Δk, and a quadratic objective function:

This expression can be obtained through Taylor expansion, ignoring higher-order terms:

The algorithm seeks an sk that approximates the minimization of qk(S) under the constraint ||S|| <= Δk. Next, the following ratio is checked to update wk and Δk:

This is the ratio of the actual decrease in function value to the decrease in function value predicted by the quadratic approximation model. Based on the previous calculation results, the size of the trust region is dynamically adjusted.

The trust region Newton’s method has practical applications in logistic regression and linear support vector solving, and specific details can be read in the liblinear open-source library.

Divide and Conquer Method

The divide and conquer method is an algorithm design idea that decomposes a large problem into smaller subproblems for solving. The solution to the entire problem is constructed based on the solutions to the subproblems. In optimization methods, the specific approach is to adjust only part of the components of the optimization vector during each iteration while keeping the other components fixed.

Coordinate Descent Method

The basic idea of the coordinate descent method is to optimize one variable at a time; this is a form of divide and conquer. Assume the optimization problem to be solved is;

The process of solving using the coordinate descent method is to select one component xi for optimization while keeping the other components fixed, thus converting the extreme value problem of a multivariable function into an extreme value problem of a univariate function. If the problem to be solved is large in scale, this approach can effectively speed up the process.

The coordinate descent method has practical applications in solving logistic regression and linear support vectors, and specific details can be read in the liblinear open-source library.

SMO Algorithm

The SMO algorithm is also a divide and conquer method, used to solve the dual problem of support vector machines. The dual problem with slack variables and kernel functions is:

The core idea of the SMO algorithm is to optimize two components αi and αj from the optimization variables at a time while keeping the other components fixed, ensuring that the equality constraint conditions are satisfied. The reason for choosing to optimize two variables instead of one is that there are equality constraints; adjusting only one variable would violate the equality constraint.

Assuming the selected two components are αi and αj, while keeping the other components fixed as constants, the target function for these two variables is a bivariate quadratic function. This problem also has equality and inequality constraint conditions. The exact solution to this subproblem can be directly obtained, which is the extreme value of a univariate quadratic function within a certain interval.

Stage-wise Optimization

Stage-wise optimization involves fixing some components of the optimization variable X during each iteration while optimizing another part of the variable b; then fixing b while optimizing a. This process continues until convergence to the optimal solution.

The AdaBoost algorithm is a typical representative of this method. The AdaBoost algorithm uses an exponential loss function during training:

Since the strong classifier is a weighted sum of multiple weak classifiers, substituting the above loss function gives the objective function to be optimized during algorithm training:

Here, the exponential loss function is split into two parts: the existing strong classifier Fj−1 and the loss function of the current weak classifier f on the training samples; the former has already been obtained in previous iterations, so it can be treated as a constant. Thus, the objective function can be simplified to:

Where:

This problem can be solved in two steps: first treating the weights of the weak classifier β as constants to obtain the optimal weak classifier f. After obtaining the weak classifier, the weight coefficient β is then optimized.

Dynamic Programming Algorithm

Dynamic programming is also a solving idea that decomposes a problem into subproblems. If a certain solution to the entire problem is optimal, then any part of that solution is also optimal for the subproblems. By solving the subproblems, the optimal solution is obtained, gradually extending to get the optimal solution for the entire problem.

Typical representatives of this type of method are the decoding algorithm (Viterbi algorithm) of hidden Markov models and dynamic programming algorithms in reinforcement learning. Such algorithms are generally used for optimization of discrete variables and are combinatorial optimization problems. The previously discussed derivative-based optimization algorithms cannot be used. Dynamic programming algorithms can efficiently solve such problems, and their foundation is the Bellman optimality principle. Once written in the recursive form of the optimization equation, algorithms can be constructed for solving.

Source: Internet

-END-

A Comprehensive Guide to Optimization Algorithms in Machine Learning

Leave a Comment Cancel reply