Intuitive Explanation of Logistic Regression by Andrew Ng

Deep Learning is a major branch of Machine Learning, which attempts to use algorithms with multiple processing layers containing complex structures or multiple nonlinear transformations to perform high-level abstractions of data.

Logistic Regression (Logistic Regression, also translated as “Log-Odds Regression”) is one of the discrete choice models, belonging to the category of multivariate analysis. It is a commonly used method in statistical empirical analysis in sociology, biostatistics, clinical studies, quantitative psychology, econometrics, marketing, and more.

Symbol Convention

Logistic regression is generally used for binary classification problems. Given some input, the output is a discrete value. For example, using logistic regression to implement a cat classifier, the input is an image x, predicting whether the image is of a cat, resulting in the probability y that there is a cat in the image.

An image is a type of unstructured data. When stored in a computer using RGB encoding, an image is encoded using red, green, and blue as three primary colors, with each pixel encoded by hexadecimal color codes for the three colors, forming a matrix containing three channels, each of which has the same size as the image. For example, if the cat image is 64*64, then each channel’s size represented as a matrix is also 64*64.

Pattern Recognition and various types of data processed in machine learning need to be represented by some feature vectors. To represent the image in the previous example as a feature vector x, the values of the three colors from each of the three channels are split and reshaped, forming a feature vector with dimensions nx = 64*64*3=12288:

Logistic Regression

Logistic regression is a learning algorithm used to solve supervised learning problems, and the goal of performing logistic regression is to minimize the error between the label values of the training data and the predicted values.

The function graph is:

Cost Function

However, in logistic regression, this loss function is generally not used because using this loss function during the training parameter process will yield a non-convex function, resulting in many local optimum solutions. In this case, using Gradient Descent will not achieve the optimal solution. For the logistic regression model, we hope to satisfy the conditional probability:

p(y|x) needs to be maximized, while the loss function needs to be minimized. Therefore, adding a negative sign to the original expression allows it to serve as a loss function. By adding a negative sign to the right side of the above expression, we derive the log loss function used in logistic regression, expressed as:

The overall cost function for m training samples can be derived using the method of Maximum Likelihood Estimation.

Assuming all training samples are independent and identically distributed, the joint probability is the product of the probabilities of all samples:

Gradient Descent

To find the values of parameters w and b that minimize the cost function, the gradient descent method is generally used. The gradient at a point in a scalar field points in the direction of the fastest increase of the scalar field, and the length of the gradient is the maximum rate of change.

In the spatial coordinates, plotting the three-dimensional image of the loss function J(w,b) with w and b as axes reveals that this function is a convex function. To find suitable parameters, initial values for w and b are assigned, as shown by the small red dot in the figure.

In logistic regression, almost any initialization method is effective, and parameters are usually initialized to zero. Random initialization also works, but it is generally not done in logistic regression because the cost function is convex. Regardless of the initialization value, it will always converge to the same point or a similar point.

Gradient descent starts from the initial point and attempts to descend in the steepest direction to reach the lowest point as quickly as possible. The direction of this descent is the gradient value at that point.

In a two-dimensional image, following the direction of the derivative shows the fastest descent speed, mathematically expressed as:

References

Andrew Ng – Neural Networks and Deep Learning – NetEase Cloud Classroom
Andrew Ng – Neural Networks and Deep Learning – Coursera
deeplearning.ai
Course Code and Materials – GitHub

Note: The images and materials involved in this article are compiled and translated from Andrew Ng’s Deep Learning series courses, with copyright owned by him. The level of translation and organization is limited; any discrepancies are welcome to be pointed out.

Recommended Reading:

[Mathematical Foundation of Machine Learning] Animated Explanation of Taylor Series (Part 2)

[Mathematical Foundation of Machine Learning] Animated Explanation of Taylor Series (Part 1)

[Basic Mathematics] Understanding the Essence of Taylor Expansion?

Welcome to follow our public account for learning and communication~

Intuitive Explanation of Logistic Regression by Andrew Ng

Logistic regression is a learning algorithm used to solve supervised learning problems, and the goal of performing logistic regression is to minimize the error between the label values of the training data and the predicted values.

To find the values of parameters w and b that minimize the cost function, the gradient descent method is generally used. The gradient at a point in a scalar field points in the direction of the fastest increase of the scalar field, and the length of the gradient is the maximum rate of change.

Leave a Comment Cancel reply