Source: Big Data DT
This article is approximately 3150 words long and is recommended to be read in 6 minutes.
This article introduces the top ten most popular machine learning algorithms.
Machine learning is an innovative and important field in the industry. The type of algorithm we choose for a machine learning program depends on the goals we want to achieve.
Currently, there are many algorithms in machine learning. Therefore, with so many algorithms, it can be quite overwhelming for beginners. Today, we will briefly introduce 10 of the most popular machine learning algorithms so that you can adapt to this exciting world of machine learning!
Let’s get back to the point!
01. Linear Regression
Linear Regression is perhaps the most popular machine learning algorithm. Linear regression aims to find a line that best fits the data points in a scatter plot. It tries to represent the independent variable (x values) and the numerical outcome (y values) by fitting the line equation to the data. This line can then be used to predict future values!
The most commonly used technique for this algorithm is the Least Squares method. This method calculates the best fit line such that the vertical distance to each data point on the line is minimized. The total distance is the sum of the squares of the vertical distances (green line) of all data points. The idea is to fit the model by minimizing this squared error or distance.
For example, simple linear regression has one independent variable (x-axis) and one dependent variable (y-axis).
02. Logistic Regression
Logistic Regression is similar to Linear Regression, but it is used for cases where the output is binary (i.e., when the result can only have two possible values). The prediction of the final output is a non-linear S-shaped function called the logistic function, g().
This logistic function maps the intermediate result values to the outcome variable Y, which ranges from 0 to 1. These values can then be interpreted as the probability of Y occurring. The properties of the S-shaped logistic function make logistic regression more suitable for classification tasks.
The logistic regression curve graph shows the relationship between the probability of passing an exam and the study time.
03. Decision Trees
Decision Trees can be used for both regression and classification tasks.
In this algorithm, the training model learns to predict the value of the target variable by learning decision rules represented in tree representation. The tree consists of nodes with corresponding attributes.
At each node, we ask questions about the data based on the available features. The left and right branches represent possible answers. The final nodes (i.e., leaf nodes) correspond to a predicted value.
The importance of each feature is determined by a top-down approach. The higher the node, the more important its attribute.
An example of a decision tree determining whether to wait at a restaurant.
04. Naive Bayes
Naive Bayes is based on Bayes’ theorem. It measures the probability of each class, given the value of x for each class. This algorithm is used for classification problems, yielding a binary “yes/no” result. Look at the equation below.
The Naive Bayes classifier is a popular statistical technique used for filtering spam!
05. Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised algorithm used for classification problems. SVM tries to draw two lines between data points that maximize the margin between them. To do this, we plot the data items as points in n-dimensional space, where n is the number of input features. Based on this, SVM finds an optimal boundary called a hyperplane that best separates the possible outputs by class labels.
The distance between the hyperplane and the nearest class points is called the margin. The optimal hyperplane has the largest margin, which can classify the points such that the distance between the nearest data points and the two classes is maximized.
For example, H1 does not separate the two classes. But H2 does, although with a small margin. H3 separates them with the largest margin.
06. K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) is very simple. KNN classifies objects by searching for the K most similar instances in the entire training set, i.e., K neighbors, and assigning a common output variable to all these K instances.
The choice of K is crucial: a smaller value may yield a lot of noise and inaccurate results, while a larger value is impractical. It is most commonly used for classification but can also be applied to regression problems.
The distance used to evaluate the similarity between instances can be Euclidean distance, Manhattan distance, or Minkowski distance. Euclidean distance is the straight-line distance between two points. It is actually the square root of the sum of the squares of the differences of the point coordinates.
▲ KNN classification example
07. K-Means
K-Means is a clustering algorithm that classifies a dataset. For example, this algorithm can be used to group users based on purchasing history. It finds K clusters in the dataset. K-Means is used for unsupervised learning, so we only need to use the training data X and the number of clusters K we want to identify.
The algorithm iteratively assigns each data point to one of the K groups based on the features of each data point. It selects K points for each K-cluster (called centroids). Based on similarity, new data points are added to the cluster with the nearest centroid. This process continues until the centroids stop changing.
08. Random Forest
Random Forest is a very popular ensemble machine learning algorithm. The basic idea of this algorithm is that the opinion of many is more accurate than that of an individual. In Random Forest, we use an ensemble of decision trees (see Decision Trees).
To classify a new object, we vote from each decision tree and combine the results, then make the final decision based on the majority vote.
-
During training, each decision tree is constructed based on bootstrap samples of the training set.
-
During classification, the decision for the input instance is made based on the majority vote.
09. Dimensionality Reduction
Due to the large amount of data we can capture today, machine learning problems have become more complex. This means training is extremely slow, and it’s hard to find a good solution. This problem is commonly referred to as the “Curse of Dimensionality”.
Dimensionality Reduction attempts to solve this problem by combining specific features into higher-level features without losing the most important information. Principal Component Analysis (PCA) is the most popular dimensionality reduction technique.
PCA reduces the dimensionality of the dataset by compressing it to a lower-dimensional line or hyperplane/subspace. This retains the significant features of the original data as much as possible.
An example of achieving dimensionality reduction by approximating all data points to a line.
10. Artificial Neural Networks (ANN)
Artificial Neural Networks (ANN) can handle large and complex machine learning tasks. Neural networks are essentially a set of interconnected layers composed of nodes with weights, called neurons. We can insert multiple hidden layers between the input and output layers. Artificial Neural Networks use two hidden layers. Additionally, they also need to handle deep learning.
The working principle of artificial neural networks is similar to the structure of the brain. A group of neurons is assigned random weights to determine how the neurons process input data. By training the neural network on the input data, it learns the relationship between input and output. During the training phase, the system has access to the correct answers.
If the network cannot accurately recognize the input, the system will adjust the weights. After sufficient training, it will consistently recognize the correct patterns.
Each circular node represents an artificial neuron, and the arrows indicate the connections from the output of one artificial neuron to the input of another.
What’s next? Now that you have a basic introduction to the most popular machine learning algorithms, you are ready to learn more complex concepts and even implement them through in-depth hands-on practice.
Happy learning!
Author Introduction:
Fahim ul Haq has worked at Facebook and Microsoft. Co-founder of Educative.io. Educative aims to help students learn programming knowledge through interactive courses.
Original link:
https://towardsdatascience.com/the-top-10-ml-algorithms-for-data-science-in-5-minutes-4ffbed9c8672
