Understanding Machine Learning in Simple Terms

This article is reprinted from the public accountDatawhale

Translator:Ahong, Source:dataxon

Machine learning is a hot topic, but aside from those who are well-versed in it, very few can explain what it really is. When reading articles about machine learning online, you are likely to encounter two situations: either a heavy academic trilogy filled with various theorems (I can barely handle half a theorem) or grandiose stories about artificial intelligence, data science magic, and the future of work.

Understanding Machine Learning in Simple Terms

I decided to write a long-awaited article to provide a simple introduction for those who want to understand machine learning. It will not involve advanced principles, but will use simple language to discuss real-world problems and practical solutions. Whether you are a programmer or a manager, you will be able to understand it.

So let’s get started!

Why Do We Want Machines to Learn?

Now we have Billy, who wants to buy a car, and he wants to calculate how much he needs to save each month to afford it. After browsing dozens of ads online, he learns that a new car costs around $20,000, a one-year-old used car costs $19,000, and a two-year-old car costs $18,000, and so on.

As a clever analyst, Billy discovers a pattern: the price of a car depends on its age; for each additional year, the price drops by $1,000, but it won’t go below $10,000.

In machine learning terms, Billy has invented “regression” — predicting a value (price) based on known historical data. When people try to estimate a reasonable price for a used iPhone on eBay or calculate how many ribs to prepare for a barbecue party, they are using a method similar to Billy’s — 200g per person? 500?

Yes, it would be great if there were a simple formula to solve all the world’s problems — especially for barbecue parties — but unfortunately, that’s not possible.

Let’s return to the car-buying scenario. Now the problem is that, besides age, cars have different manufacturing dates, dozens of accessories, technical conditions, seasonal demand fluctuations… God knows what other hidden factors there are… Ordinary Billy cannot consider all this data when calculating prices, and neither can I.

People are both lazy and foolish — we need robots to help them do math. Therefore, we use computer methods — provide the machine with some data and let it find all the potential patterns related to price.

Finally, it works! The most exciting part is that compared to a human carefully analyzing all the dependent factors in their mind, machines handle it much better.

And that’s how machine learning was born.

The Three Components of Machine Learning

Setting aside all the nonsense related to artificial intelligence (AI), the only goal of machine learning is to predict outcomes based on input data, that’s it. All machine learning tasks can be represented in this way; otherwise, it wouldn’t be a machine learning problem from the start.

The more diverse the samples, the easier it is to find related patterns and predict outcomes. Therefore, we need three components to train the machine:

Data

Want to detect spam? Get samples of spam messages. Want to predict stocks? Find historical price information. Want to identify user preferences? Analyze their activity records on Facebook (no, Mark, stop collecting data — it’s enough already). The more diverse the data, the better the results. For a machine that is working hard, at least hundreds of thousands of lines of data are needed.

There are two main ways to acquire data — manually or automatically. Manually collected data has fewer mixed errors but takes more time — usually costs more too. Automated methods are relatively cheap; you can gather all the data you can find (hopefully, the data quality is good).

Some smart guys like Google use their users to annotate data for free. Remember ReCaptcha (human verification) forcing you to “select all the road signs”? That’s how they acquire data, still free labor! Well done. If I were them, I would show these verification images more often, but wait…

Good datasets are really hard to obtain; they are so important that some companies might even open their algorithms but rarely publish their datasets.

Features

Also known as “parameters” or “variables,” such as the mileage of a car, user gender, stock prices, word frequency in documents, etc. In other words, these are the factors the machine needs to consider.

If data is stored in a tabular format, features correspond to column names, which is relatively simple. But what if it’s 100GB of cat images? We can’t treat every pixel as a feature. This is why selecting appropriate features often takes more time than other steps in machine learning, and feature selection is also a major source of error. Human subjectivity leads people to choose features they like or feel are “more important” — this should be avoided.

Algorithms

The most obvious part. Any problem can be solved in different ways. The method you choose will affect the final model’s accuracy, performance, and size. One point to note: if the data quality is poor, even the best algorithm will not help. This is known as “garbage in, garbage out” (GIGO). Therefore, before spending a lot of effort on accuracy, one should acquire more data.

Learning vs. Intelligence

I once saw an article on some popular media sites titled “Will Neural Networks Replace Machine Learning?” These media people always inexplicably exaggerate techniques like linear regression as “artificial intelligence,” almost calling it “Skynet.” The following diagram shows the relationship between several easily confused concepts.

“Artificial Intelligence” is the name of the entire discipline, similar to “biology” or “chemistry.”
“Machine Learning” is an important component of “artificial intelligence” but not the only one.
“Neural Networks” is a branch method of machine learning, which is very popular, but there are other branches under the machine learning family.
“Deep Learning” is a modern method of building, training, and using neural networks. Essentially, it is a new architecture. In current practice, no one distinguishes deep learning from “ordinary networks”; the libraries needed to call them are the same. To avoid looking like a fool, you better specify the type of network directly and avoid using buzzwords.

The general principle is to compare things at the same level. This is why “neural networks will replace machine learning” sounds like “wheels will replace cars.” Dear media, this will severely damage your reputation.

What Machines Can Do	What Machines Cannot Do
Predict	Create new things
Remember	Become smart quickly
Copy	Go beyond the task scope
Select the best option	Eliminate all of humanity

The Landscape of the Machine Learning World

If you are too lazy to read long texts, the following image can help you gain some understanding.

In the world of machine learning, there is never a unique way to solve a problem — it is important to remember this — because you will always find several algorithms that can be used to solve a problem; you need to choose the one that best fits. Of course, all problems can be handled by “neural networks,” but who will bear the hardware costs behind the computing power?

Let’s start with some basic overviews. Currently, machine learning mainly has four directions.

Part 1Classic Machine Learning Algorithms

Classic machine learning algorithms originated from pure statistics in the 1950s. Statisticians were solving formal math problems such as finding patterns in numbers, estimating distances between data points, and calculating vector directions.

Today, half of the internet is studying these algorithms. When you see a column of “continue reading” articles, or find your bank card locked at some remote gas station, it’s likely one of these little guys at work.

Large tech companies are staunch advocates of neural networks. The reason is obvious; for these large enterprises, a 2% accuracy improvement means an additional $2 billion in revenue. However, when a company’s business volume is small, it becomes less important. I heard of a team that spent a year developing a new recommendation algorithm for their e-commerce site, only to find later that 99% of the traffic on the site came from search engines — their algorithm was useless since most users didn’t even open the homepage.

Although classic algorithms are widely used, their principles are quite simple, and you can easily explain them to a toddler. They are like basic arithmetic — we use them every day without even thinking.

1.1 Supervised Learning

Classic machine learning is usually divided into two categories: Supervised Learning and Unsupervised Learning.

In “supervised learning,” there is a “supervisor” or “teacher” who provides the machine with all the answers to assist in learning, such as whether an image is of a cat or a dog. The “teacher” has already divided the dataset — labeling it as “cat” or “dog,” and the machine uses these example data to learn to distinguish between cats and dogs.

Unsupervised learning means that the machine has to distinguish who is who from a pile of animal pictures on its own. The data has not been pre-labeled, and there is no “teacher”; the machine must find all possible patterns by itself. This will be discussed later.

It is clear that when a “teacher” is present, the machine learns faster, which is why supervised learning is more commonly used in real life. Supervised learning is divided into two categories:

Classification, predicting the category to which an object belongs;
Regression, predicting a specific point on a number line;

Classification

“Classifying objects based on a known attribute, such as categorizing socks by color, classifying documents by language, or categorizing music by style.”

Classification algorithms are commonly used for:

Filtering spam;
Language detection;
Finding similar documents;
Sentiment analysis;
Recognizing handwritten letters or numbers;
Fraud detection;

Common algorithms include:

Naive Bayes
Decision Tree
Logistic Regression
K-Nearest Neighbors
Support Vector Machine

Machine learning mainly solves “classification” problems. This machine is like a baby learning to categorize toys: this is a “robot,” this is a “car,” this is a “machine-car”… Oops, wait, wrong! Wrong!

In classification tasks, you need a “teacher.” The data needs to be pre-labeled so that the machine can learn to classify based on these labels. Everything can be classified — classifying users based on interests, classifying articles based on language and theme (which is crucial for search engines), classifying music based on type (Spotify playlists), and your emails are no exception.

The Naive Bayes algorithm is widely used for spam filtering. The machine counts the frequency of words like “Viagra” in spam and normal emails, then applies Bayes’ theorem multiplied by their respective probabilities, and sums the results — ha, the machine has completed its learning.

Later, spammers learned how to cope with Bayes filters — adding many “good” words to the end of the email content — this method is ironically referred to as “Bayesian poisoning.” Naive Bayes has entered history as the most elegant and first practical algorithm, but there are now other algorithms to handle spam filtering problems.

Another example of a classification algorithm. Suppose you need to borrow some money; how does the bank know if you will repay it in the future? They can’t be sure. However, banks have many historical borrower profiles containing data such as “age,” “education level,” “occupation,” “salary,” and — most importantly — “whether they repaid.”

Using this data, we can train the machine to find patterns and draw conclusions. Finding the answer is not the problem; the issue is that banks cannot blindly trust the answers given by the machine. What if the system malfunctions, gets hacked, or a drunken graduate just patched the system in an emergency?

To handle this issue, we need to use decision trees; all data is automatically divided into “yes/no” questions — for example, “Does the borrower’s income exceed $128.12?” — sounds a bit inhumane. However, the machine generates such questions to optimally partition the data at each step.

That’s how the “tree” is formed. The higher the branch (closer to the root node), the broader the scope of the questions. All analysts can accept this approach and provide explanations afterward, even if they don’t understand how the algorithm works; they can still easily explain the results (typical analysts!).

Decision trees are widely used in high-stakes scenarios: diagnosis, medicine, and finance.

The two most well-known decision tree algorithms are CART and C4.5.

Nowadays, pure decision tree algorithms are rarely used. However, they are the foundation of large systems, and the performance of decision tree ensembles can even surpass that of neural networks. We will discuss this later.

When you search on Google, it is a bunch of clumsy “trees” helping you find answers. Search engines like these algorithms because they run fast.

In theory, Support Vector Machine (SVM) should be the most popular classification method. Anything that exists can be classified using it: classifying plants by shape in images, classifying documents by category, etc.

The idea behind SVM is simple — it tries to draw two lines between data points and maximize the distance between the two lines as much as possible. As illustrated below:

Classification algorithms have a very useful scenario — anomaly detection. If a certain feature cannot be assigned to all categories, we mark it. This method is now used in the medical field — in MRI (Magnetic Resonance Imaging), computers mark all suspicious areas or deviations within the detection range. The stock market uses it to detect traders’ anomalous behavior to find insiders. When training a computer to distinguish what is correct, we also automatically teach it to recognize what is incorrect.

The rule of thumb indicates that the more complex the data, the more complex the algorithm. For text, numbers, and tables, I would choose classic methods to operate. These models are smaller, learn faster, and have clearer workflows. For images, videos, and other complex big data, I would definitely study neural networks.

Just five years ago, you could still find SVM-based face classifiers. Now, it is easier to pick one from hundreds of pre-trained neural network models. However, spam filters have not changed; they are still written using SVM, and there is no reason to change that. Even my website uses SVM to filter spam in comments.

Regression

“Draw a line through these points, hmm~ that’s machine learning”

Regression algorithms are currently used for:

Stock price prediction;
Supply and sales volume analysis;
Medical diagnosis;
Calculating time series correlations;

Common regression algorithms include:

Linear Regression
Polynomial Regression

The “regression” algorithm is essentially also a “classification” algorithm; it just predicts a value instead of a category. For example, predicting the price of a car based on mileage, estimating traffic volume at different times of the day, and predicting changes in supply volume with company growth, etc. When dealing with time-related tasks, regression algorithms are the best choice.

Regression algorithms are favored by finance or analytics professionals. It has even become a built-in feature in Excel, and the entire process is very smooth — the machine simply tries to draw a line that represents average correlation. However, unlike a person with a pen and whiteboard, the machine does this with mathematical precision by calculating the average distance of each point from the line.

If the line drawn is straight, it’s “linear regression”; if the line is curved, it’s “polynomial regression.” These are the two main types of regression. Other types are relatively rare. Don’t be fooled by the “Logistics regression” which is a “bad apple”; it is a classification algorithm, not a regression.

However, it’s okay to mix up “regression” and “classification.” Some classifiers can turn into regression after adjusting parameters. Besides defining the category of an object, we also need to remember how close the object is to that category, leading to the regression problem. If you want to delve deeper, you can read the article“Machine Learning for Humans”^[1](highly recommended).

1.2 Unsupervised Learning

Unsupervised learning appeared slightly later than supervised learning — in the 1990s, this type of algorithm was used relatively less, sometimes just because there were no other options.

Having labeled data is a luxury. Suppose I want to create a — let’s say “bus classifier,” do I have to personally take millions of photos of those damn buses on the street and label each one? No way, that would take a lifetime, and I still have many games on Steam to play.

In this case, we still have a glimmer of hope for capitalism, thanks to the social crowdsourcing mechanism, we can get millions of cheap labor and services. For example, Mechanical Turk^[2] is backed by a group of people ready to help you complete tasks for a reward of $0.05. That’s usually how it gets done.

Or, you can try using unsupervised learning. But I don’t remember any best practices about it. Unsupervised learning is usually used for exploratory data analysis rather than as a primary algorithm. Those with degrees from Oxford and special training feed the machine a large pile of garbage and then start observing: are there any clusters? No. Can we see any connections? No. Well, next, you still want to work in data science, right?

Clustering

“The machine will choose the best way to distinguish things based on some unknown features.”

Clustering algorithms are currently used for:

Market segmentation (customer types, loyalty);
Merging neighboring points on a map;
Image compression;
Analyzing and labeling new data;
Detecting anomalous behavior;

Common algorithms:

K-means Clustering
Mean-Shift
DBSCAN

Clustering is done without pre-labeled categories. It’s like when you can’t remember the colors of all your socks but can still categorize them. Clustering algorithms try to find similar things (based on certain features) and group them into clusters. Objects with many similar features gather together and are classified into the same category. Some algorithms even support setting the exact number of data points in each cluster.

Here’s a good example of clustering — markers on online maps. When you look for vegan restaurants nearby, the clustering engine groups them and displays them with numbered bubbles. If it didn’t do this, the browser would freeze — because it would try to plot all 300 vegan restaurants in this trendy city on the map.

Apple Photos and Google Photos use more complex clustering methods. They create albums of your friends by searching for faces in the photos. The application doesn’t know how many friends you have or what they look like, but it can still find common facial features. This is typical clustering.

Another common application scenario is image compression. When an image is saved in PNG format, the colors can be set to 32 colors. This means the clustering algorithm has to find all the “red” pixels, calculate the “average red,” and assign this average to all the red pixels. Fewer colors mean smaller files — a good deal!

However, it can get tricky with colors like . Is it green or blue? This is when the K-Means algorithm comes into play.

First, randomly select 32 color points as “cluster centers” from the colors, and label the remaining points according to the nearest cluster center. This way, we get “star clusters” around the 32 color points. Then we move the cluster centers to the “centers of the stars” and repeat the above steps until the cluster centers stop moving.

Done. Exactly 32 stable clusters have been formed.

Here’s a real-life example:

Finding cluster centers this way is convenient, but real-world clusters are not always circular. Suppose you are a geologist and need to find similar ores on a map. In this case, the shape of the clusters may be strange or even nested. You might not even know how many clusters there are, 10? 100?

K-means algorithm might not work here, but DBSCAN algorithm will. We treat data points like people in a square, asking any three close people to hold hands. Next, tell them to grab the hands of reachable neighbors (the positions of the people must not move throughout the process), repeat this until new neighbors join in. This way, we get the first cluster and repeat the process until everyone is assigned to a cluster, done. An unexpected gain: a person who holds no one’s hand — an anomalous data point.

The whole process looks cool.

If you are interested in learning more about clustering algorithms, you can read this article“5 Clustering Algorithms Every Data Scientist Should Know”^[3].

Like classification algorithms, clustering can be used to detect anomalies. If a user performs abnormal operations after logging in, let the machine temporarily disable their account, then create a ticket for technical support to check what’s going on. Maybe the other person is a “robot.” We don’t even have to know what “normal behavior” looks like; just feed the user’s behavior data to the model and let the machine decide whether the other party is a “typical” user.

This method may not be as effective as classification algorithms, but it’s still worth a try.

Dimensionality Reduction

“Assembling specific features into higher-level features”

“Dimensionality Reduction” algorithms are currently used for:

Recommendation systems;
Beautiful visualizations;
Topic modeling and finding similar documents;
Fake image recognition;
Risk management;

Common “dimensionality reduction” algorithms:

Principal Component Analysis (PCA)
Singular Value Decomposition (SVD)
Latent Dirichlet Allocation (LDA)
Latent Semantic Analysis (LSA, pLSA, GLSA)
t-SNE (for visualization)

Years ago, “hardcore” data scientists would use these methods, determined to discover “interesting things” in a large pile of numbers. When Excel charts didn’t work, they forced machines to do the pattern-finding job. Thus, they invented dimensionality reduction or feature learning methods.

Projecting 2D data onto a line (PCA)

For people, abstract concepts are more convenient than a large pile of fragmented features. For example, we can combine dogs with triangular ears, long noses, and big tails into the abstract concept of “shepherd dogs.” Compared to a specific shepherd dog, we do lose some information, but the new abstract concept is more useful in scenarios that require naming and explanation. As a reward, these “abstract” models learn faster, use fewer features during training, and reduce overfitting.

These algorithms can shine in the task of “topic modeling.” We can abstract their meanings from specific phrases. Latent Semantic Analysis (LSA) is designed for this purpose, which is based on the frequency of specific words you can see on a certain topic. For example, technology-related vocabulary is likely to appear more in technology articles, or politicians’ names mostly appear in political news, and so on.

We can create clusters directly from all words in all articles, but doing so would lose all important connections (for example, the meanings of battery and accumulator are the same in different articles), and LSA can handle this problem well, which is why it’s called “latent semantic.”

Therefore, we need to combine words and documents into a feature to maintain their latent connections — people have found that Singular Value Decomposition (SVD) can solve this problem. Those useful topic clusters can be easily seen from the grouped phrases.

Recommendation systems and collaborative filtering are another high-frequency area using dimensionality reduction algorithms. If you extract information from user ratings, you will get a great system for recommending movies, music, games, or anything else you want.

Here’s a book I love: “Programming Collective Intelligence”, which was my bedside book during college.

Completely understanding this abstraction on machines is almost impossible, but you can pay attention to some correlations: some abstract concepts are related to user age — children play “Minecraft” or watch cartoons more, while others may relate to movie styles or user preferences.

Based solely on information like user ratings, machines can find these high-level concepts without even needing to understand them. Well done, Mr. Computer. Now we can write a paper on “Why Bearded Lumberjacks Love My Little Pony.”

Association Rule Learning

“Finding patterns in order streams”

“Association rules” are currently used for:

Predicting sales and discounts;
Analyzing products purchased together;
Planning product displays;
Analyzing web browsing patterns;

Common algorithms:

Apriori
Euclat
FP-growth

Algorithms used to analyze shopping carts, automate marketing strategies, and other event-related tasks are here. If you want to discover some patterns from a sequence of items, give them a try.

For example, a customer takes a six-pack of beer to the checkout. Should we place peanuts along the checkout path? How often do people buy beer and peanuts together? Yes, association rules are likely applicable to the beer and peanuts scenario, but what other sequences can we predict? Can small changes in product layout lead to significant profit increases?

This idea also applies to e-commerce, where the tasks are even more interesting — What will the customer buy next?

I don’t know why rule learning seems to be rarely mentioned within the scope of machine learning. The classic method applies trees or ensemble methods based on a positive check of all purchased items. The algorithm can only search for patterns but cannot generalize or reproduce these patterns on new examples.

In the real world, every large retailer has established its own exclusive solutions, so this will not bring you any revolutions. The most advanced technology mentioned here is the recommendation system. However, I might not be aware of any breakthroughs in this area. If you have anything to share, please let me know in the comments.

Why Do We Want Machines to Learn?

The Three Components of Machine Learning

Data

Features

Algorithms

Learning vs. Intelligence

The Landscape of the Machine Learning World

1.1 Supervised Learning

Classification

Regression

1.2 Unsupervised Learning

Clustering

Dimensionality Reduction

Association Rule Learning

Leave a Comment Cancel reply