Recommended: Illustrated Guide to the 10 Most Common Machine Learning Algorithms!

Reprinted from:Author: james_aka_yale

In the field of machine learning, there is a saying: “There is no free lunch in the world,” which means that no single algorithm can perform best on every problem. This theory is particularly important in supervised learning.

For example, you cannot say that neural networks are always better than decision trees, or vice versa. The performance of a model is influenced by many factors, such as the size and structure of the dataset.

Therefore, you should try many different algorithms based on your problem, while using a data test set to evaluate performance and select the best one.

Of course, the algorithms you try must be relevant to your problem. This is the main task of machine learning. For instance, if you want to clean your house, you might use a vacuum cleaner, broom, or mop, but you definitely wouldn’t start digging a hole with a shovel.

For those eager to understand the basics of machine learning, here is a list of the top ten machine learning algorithms used by data scientists, introducing the characteristics of these algorithms to help everyone better understand and apply them. Come take a look!

01 Linear Regression

Linear regression is perhaps one of the most well-known and easily understood algorithms in statistics and machine learning.

Since predictive modeling primarily focuses on minimizing model errors or making the most accurate predictions at the cost of interpretability, we borrow, reuse, and steal algorithms from many different fields, which involves some statistical knowledge.

Linear regression is represented by an equation that describes the linear relationship between input variables (x) and output variables (y) by finding specific weights (B) for the input variables.

Recommended: Illustrated Guide to the 10 Most Common Machine Learning Algorithms!

Linear Regression

Example: y = B0 + B1 * x

Given input x, we will predict y. The goal of the linear regression learning algorithm is to find the values of coefficients B0 and B1.

Different techniques can be used to learn the linear regression model from data, such as linear algebra solutions for ordinary least squares and gradient descent optimization.

Linear regression has been around for over 200 years and has been widely studied. If possible, some rules of thumb when using this technique are to remove very similar (correlated) variables and eliminate noise from the data. This is a quick, simple technique and a good first algorithm.

02 Logistic Regression

Logistic regression is another technique borrowed from the statistical field for machine learning. It is a specialized method for binary classification problems (problems with two class values).

Logistic regression is similar to linear regression, as both aim to find the weight values for each input variable. Unlike linear regression, however, the predicted output values are transformed using a nonlinear function called the logistic function.

The logistic function looks like a large S and can convert any value into a range between 0 and 1. This is useful because we can apply corresponding rules to the output of the logistic function to classify values as 0 and 1 (for example, if IF less than 0.5, then output 1) and predict class values.

Logistic Regression

Due to the unique learning method of the model, predictions made by logistic regression can also be used to calculate the probability of belonging to class 0 or class 1. This is very useful for problems that need to provide many fundamental principles.

Like linear regression, logistic regression indeed performs better when you remove attributes that are unrelated to the output variable and those that are very similar (correlated) to each other. This is a quick-learning model that effectively handles binary classification problems.

03 Linear Discriminant Analysis

Traditional logistic regression is limited to binary classification problems. If you have more than two classes, then Linear Discriminant Analysis (LDA) is the preferred linear classification technique.

LDA is represented very simply. It consists of the statistical properties of your data, calculated according to each class. For a single input variable, this includes:

Mean for each class.
Variance calculated across all classes.

Linear Discriminant Analysis

LDA performs by calculating the discriminant value for each class and predicting the class with the maximum value. This technique assumes that the data has a Gaussian distribution (bell curve), so it is best to manually remove outliers from the data first. This is a simple yet powerful method for classification predictive modeling problems.

04 Classification and Regression Trees

Decision trees are an important algorithm in machine learning.

The decision tree model can be represented as a binary tree. Yes, it’s a binary tree from algorithms and data structures, nothing special. Each node represents a single input variable (x) and the left and right children on that variable (assuming the variable is numeric).

Decision Tree

The leaf nodes of the tree contain output variables (y) used for making predictions. Predictions are made by traversing the tree, stopping when a leaf node is reached, and outputting the class value of that leaf node.

Decision trees learn quickly and predict quickly. They often predict accurately for many problems, and you do not need to do any special preparation for the data.

05 Naive Bayes

Naive Bayes is a simple yet extremely powerful predictive modeling algorithm.

The model consists of two types of probabilities that can be directly calculated from your training data: 1) the probability of each class; 2) the conditional probability of each x value given each class. Once calculated, the probability model can be used to make predictions on new data using Bayes’ theorem. When your data is numerical, it is often assumed to follow a Gaussian distribution (bell curve) so that these probabilities can be easily estimated.

Bayes Theorem

Naive Bayes is called naive because it assumes that each input variable is independent. This is a strong assumption that is unrealistic for real data, but the technique is still very effective for a wide range of complex problems.

06 K-Nearest Neighbors

The KNN algorithm is very simple and very effective. The KNN model is represented by the entire training dataset. Isn’t that simple?

To predict a new data point, it searches for the K most similar instances (neighbors) within the entire training set and aggregates the output variables of these K instances. For regression problems, the new point may be the average output variable, while for classification problems, the new point may be the mode of the class values.

The key to success lies in how to determine the similarity between data instances. If your attributes are all on the same scale, the simplest way is to use Euclidean distance, which can be directly calculated based on the distance between each input variable.

K-Nearest Neighbors

KNN may require a lot of memory or space to store all the data, but it only performs computations (or learning) when predictions are needed. You can also update and manage your training set at any time to maintain prediction accuracy.

The concept of distance or closeness may break down in high-dimensional environments (with many input variables), which can negatively affect the algorithm. These events are referred to as the curse of dimensionality. It also suggests that you should only use those input variables that are most relevant to the predicted output variable.

07 Learning Vector Quantization

The downside of KNN is that you need to maintain the entire training dataset. The Learning Vector Quantization algorithm (or LVQ for short) is an artificial neural network algorithm that allows you to suspend any number of training instances and learn them accurately.

Learning Vector Quantization

LVQ is represented by a set of codebook vectors. Initially, vectors are randomly selected and then iteratively adapted to the training dataset. After learning, the codebook vectors can be used to predict like K-Nearest Neighbors. The most similar neighbor (best match) is found by calculating the distance between each codebook vector and the new data instance, and then returning the class value of the best matching unit or the actual value in the case of regression as the prediction. You can achieve the best results if you keep the data within the same range (e.g., between 0 and 1).

If you find that KNN gives good results on your dataset, try using LVQ to reduce the memory requirements of storing the entire training dataset.

08 Support Vector Machines

Support Vector Machines may be one of the most popular and discussed machine learning algorithms.

A hyperplane is a line that divides the space of input variables. In SVM, a hyperplane is selected to separate points in the input variable space by their class (class 0 or class 1). In two-dimensional space, it can be viewed as a line that can completely separate all input points. The SVM learning algorithm aims to find the coefficients that allow the hyperplane to best separate the classes.

Support Vector Machine

The distance between the hyperplane and the nearest data points is called the margin, and the hyperplane with the maximum margin is the best choice. Moreover, only the nearest data points are relevant to the definition of the hyperplane and the construction of the classifier; these points are called support vectors, as they support or define the hyperplane. In practice, we use optimization algorithms to find the coefficient values that maximize the margin.

SVM may be one of the most powerful off-the-shelf classifiers and is worth trying on your dataset.

09 Bagging and Random Forest

Random Forest is one of the most popular and powerful machine learning algorithms. It is an ensemble machine learning algorithm known as Bootstrap Aggregation or Bagging.

Bootstrap is a powerful statistical method used to estimate a quantity from a data sample, such as the mean. It draws a large number of sample data, calculates the mean, and then averages all the means to more accurately estimate the true mean.

In bagging, the same method is used, but the most common application is to decision trees instead of estimating an entire statistical model. It trains the data through multiple sampling and builds a model for each data sample. When you need to make predictions on new data, each model makes a prediction, and the predictions are averaged to better estimate the true output value.

Random Forest

Random Forest is a modification of decision trees; rather than selecting the best split point, Random Forest achieves suboptimal splits by introducing randomness.

Thus, the differences between the models created for each data sample will be greater, but still accurate in their own right. Combining the prediction results can better estimate the correct potential output value.

If you are getting good results with a high-variance algorithm (like decision trees), then adding this algorithm will yield even better results.

10 Boosting and AdaBoost

Boosting is an ensemble technique that creates a strong classifier from several weak classifiers. It first builds a model from the training data, then creates a second model to try to correct the errors of the first model. Models are added continuously until the training set is perfectly predicted or until the maximum number has been reached.

AdaBoost is the first truly successful boosting algorithm developed for binary classification and is the best starting point for understanding boosting. The most famous algorithms built on AdaBoost are currently based on stochastic gradient boosting.

AdaBoost is often used with short decision trees. After the first tree is created, the performance of each training instance on that tree determines how much attention the next tree needs to pay to that training instance. Hard-to-predict training data is given more weight, while easy-to-predict instances are given less weight. The models are created sequentially, and the updates of each model affect the learning of the next tree in the sequence. After all trees have been built, the algorithm makes predictions on new data, weighting the performance of each tree according to the accuracy of the training data.

Because the algorithm places great emphasis on error correction, having clean data without outliers is very important.

A typical question posed by beginners when faced with various machine learning algorithms is, “Which algorithm should I use?” The answer to this question depends on many factors, including:

The size, quality, and nature of the data;
The available computation time;
The urgency of the task;
What you want to do with the data.

Even an experienced data scientist cannot know which algorithm will perform best before trying different ones. While there are many other machine learning algorithms, these are the most popular ones. If you are a newcomer to machine learning, this is a great starting point for learning.

Recommended: Illustrated Guide to the 10 Most Common Machine Learning Algorithms!

Leave a Comment Cancel reply