
I decided to write a long-awaited article to provide a simple introduction to those who want to understand machine learning. We won’t discuss advanced principles; instead, we’ll use simple language to talk about real-world problems and practical solutions. Whether you are a programmer or a manager, you will be able to understand.
So let’s get started!
Why Do We Want Machines to Learn?
Here comes Billy, who wants to buy a car and needs to figure out how much he should save each month. After browsing dozens of ads online, he learns that a new car costs around $20,000, a one-year-old used car costs $19,000, and a two-year-old car costs $18,000, and so on.
As a smart analyst, Billy discovers a pattern: the price of a car depends on its age; for every year older, the price drops by $1,000, but it won’t go below $10,000.
In machine learning terms, Billy has invented “regression”—predicting a value (price) based on known historical data. When people try to estimate a reasonable price for a used iPhone on eBay or calculate how many ribs to prepare for a barbecue party, they are using a similar method to Billy’s—200g per person? 500?
Yes, it would be great if there were a simple formula to solve all the world’s problems—especially for barbecue parties—but unfortunately, that is not possible.
Let’s return to the car-buying scenario. Now the question is, besides age, there are different production dates, dozens of accessories, technical conditions, seasonal demand fluctuations… who knows what other hidden factors there are… Ordinary people like Billy cannot consider all these data when calculating the price, and neither can I.
People are lazy and foolish—we need robots to help them do the math. Therefore, we use a computer method—provide the machine with some data and let it find all the potential patterns related to the price.
Finally, it works! The most exciting part is that compared to a human carefully analyzing all the dependent factors in their mind, machines can handle it much better.
And thus, machine learning was born.
The Three Components of Machine Learning

Setting aside all the nonsense related to artificial intelligence (AI), the only goal of machine learning is to predict outcomes based on input data, that’s it. All machine learning tasks can be represented this way; otherwise, it is not a machine learning problem from the start.
The more diverse the samples, the easier it is to find associated patterns and predict outcomes. Therefore, we need three components to train the machine:
Data
Want to detect spam? Get samples of spam messages. Want to predict stocks? Find historical price information. Want to identify user preferences? Analyze their activity records on Facebook (no, Mark, stop collecting data—it’s enough). The more diverse the data, the better the outcome. For a machine that is working hard, you need at least hundreds of thousands of lines of data.
There are two main ways to acquire data—manually or automatically. Manually collected data has fewer mixed errors but takes more time—usually costs more too. Automated methods are relatively cheap; you can collect all the data you can find (hopefully with good quality).
Some smart guys like Google use their users to label data for free. Remember ReCaptcha (human verification) forcing you to “select all the road signs”? That’s how they get data, still free labor! Well done. If I were them, I would show those verification images more often, but wait…

Good datasets are really hard to obtain; they are so important that some companies might even open their algorithms but rarely disclose datasets.
Features
Also known as “parameters” or “variables,” such as the mileage of a car, user gender, stock prices, word frequency in a document, etc. In other words, these are the factors that the machine needs to consider.
If the data is stored in a tabular format, features correspond to column names, which is relatively simple. But what if it’s 100GB of cat images? We can’t treat every pixel as a feature. This is why selecting appropriate features often takes more time than other steps in machine learning, and feature selection is also a major source of error. Human subjectivity leads people to choose features they like or feel are “more important”—this should be avoided.
Algorithms
The most obvious part. Any problem can be solved in different ways. The method you choose will affect the final model’s accuracy, performance, and size. One thing to note: if the data quality is poor, even the best algorithm won’t help. This is known as “garbage in, garbage out” (GIGO). Therefore, before spending a lot of effort on accuracy, more data should be acquired.
Learning vs. Intelligence
I once saw an article titled “Will Neural Networks Replace Machine Learning?” on some popular media sites. These journalists always inexplicably exaggerate technologies like linear regression as “artificial intelligence,” almost calling it “Skynet.” The following image shows the relationships between several easily confused concepts.

-
“Artificial Intelligence” is the name of the entire discipline, similar to “biology” or “chemistry.” -
“Machine Learning” is an important part of “artificial intelligence,” but not the only part. -
“Neural Networks” is a branch method of machine learning, which is popular, but there are other branches under the machine learning umbrella. -
“Deep Learning” is a modern method of building, training, and using neural networks. Essentially, it is a new architecture. In current practice, no one distinguishes between deep learning and “regular networks”; the libraries called upon are the same. To avoid looking foolish, it’s best to directly specify the type of network and avoid buzzwords.
The general principle is to compare things at the same level. That’s why saying “neural networks will replace machine learning” sounds like “wheels will replace cars.” Dear media, this will significantly damage your reputation.
What Machines Can Do | What Machines Cannot Do |
---|---|
Predict | Create new things |
Remember | Become smart quickly |
Copy | Go beyond the task |
Select the best options | Eliminate humanity |
The Landscape of Machine Learning
If you are too lazy to read long texts, the image below helps gain some understanding.

In the world of machine learning, there is never a single way to solve a problem—remember this—because you will always find several algorithms that can be used to solve a particular problem, and you need to choose the one that fits best. Of course, all problems can be handled with “neural networks,” but who will bear the hardware costs behind the computation?
Let’s start with some basic overviews. Currently, machine learning mainly has four directions.

Part 1Classic Machine Learning Algorithms
Classic machine learning algorithms originated from pure statistics in the 1950s. Statisticians addressed formal math problems such as finding patterns in numbers, estimating distances between data points, and calculating vector directions.
Today, half of the internet studies these algorithms. When you see a column of “continue reading” articles, or when you find your bank card locked at some remote gas station, it’s likely the work of one of these little guys.
Large tech companies are staunch advocates of neural networks. The reason is obvious; for these large enterprises, a 2% increase in accuracy means an additional $2 billion in revenue. However, when a company’s business scale is small, it becomes less significant. I heard of a team that spent a year developing a new recommendation algorithm for their e-commerce site, only to find that 99% of their traffic came from search engines—the algorithm they developed was useless since most users didn’t even open the homepage.
Although classic algorithms are widely used, their principles are straightforward, and you can easily explain them to a toddler. They are like basic arithmetic—we use them every day without even thinking.
1.1 Supervised Learning
Classic machine learning is usually divided into two categories: supervised learning and unsupervised learning.
In “supervised learning,” there is a “supervisor” or “teacher” who provides the machine with all the answers to assist in learning, such as whether an image is of a cat or a dog. The “teacher” has already labeled the dataset—marking “cat” or “dog”—and the machine uses these sample data to learn to distinguish between cats and dogs.
Unsupervised learning means the machine has to distinguish who is who among a pile of animal images on its own. The data is not pre-labeled, and there is no “teacher,” so the machine must find all possible patterns by itself. This will be discussed later.
Clearly, machines learn faster when there is a “teacher” present, which is why supervised learning is more commonly used in real life. Supervised learning is divided into two categories:
-
Classification, predicting the category to which an object belongs; -
Regression, predicting a specific point on a number line;
Classification
“Classifying objects based on a known attribute, such as categorizing socks by color, classifying documents by language, or dividing music by style.”

Classification algorithms are commonly used for:
-
Filtering spam; -
Language detection; -
Finding similar documents; -
Sentiment analysis; -
Recognizing handwritten letters or numbers; -
Fraud detection;
Common algorithms include:
-
Naive Bayes -
Decision Tree -
Logistic Regression -
K-Nearest Neighbors -
Support Vector Machine
Machine learning primarily addresses “classification” problems. This machine is like a toddler learning to categorize toys: this is a “robot,” this is a “car,” this is a “machine-car”… uh, wait, wrong! Wrong!
In classification tasks, you need a “teacher.” The data must be pre-labeled so that the machine can learn to classify based on these labels. Everything can be classified—users can be classified based on interests, articles can be classified by language and topic (which is crucial for search engines), and music can be classified by type (Spotify playlists), and your emails are no exception.
The Naive Bayes algorithm is widely used for spam filtering. The machine counts the frequency of terms like “Viagra” appearing in spam and normal emails, applies the Bayes theorem multiplied by their respective probabilities, and sums the results—ha, the machine has learned!

Later, spammers learned how to counteract the Bayesian filter—by adding many “good” words at the end of the email content—this method is sarcastically referred to as “Bayesian poisoning.” Naive Bayes is recorded in history as the most elegant and the first practical algorithm, but now there are other algorithms to handle spam filtering issues.
Another example of a classification algorithm. Suppose you need to borrow money; how does the bank know whether you will pay it back in the future? They can’t be sure. However, the bank has many historical borrower profiles, with data such as “age,” “education level,” “occupation,” “salary,” and—most importantly—“whether they paid back.”
Using this data, we can train the machine to find patterns and derive answers. Finding the answer is not the issue; the problem is that banks cannot blindly trust the answers given by the machine. What if the system fails, gets hacked, or a drunken graduate just patched the system in an emergency?
To address this problem, we use decision trees (Decision Trees), where all data is automatically divided into “yes/no” questions—such as “Does the borrower’s income exceed $128.12?”—sounds a bit inhumane. However, the machine generates these questions to optimally partition the data at each step.

This is how the “tree” is formed. The higher the branches (closer to the root node), the broader the scope of the questions. All analysts can accept this approach and provide explanations afterward, even if they don’t understand how the algorithm works; they can still easily explain the results (typical of analysts)!
Decision trees are widely used in high-stakes scenarios: diagnostics, medicine, and finance.
The two most well-known decision tree algorithms are CART and C4.5.
Nowadays, pure decision tree algorithms are rarely used. However, they are the foundation of large systems, and the effects of decision tree ensembles can be even better than neural networks. We will discuss this later.
When you search on Google, it is a bunch of clumsy “trees” helping you find answers. Search engines like these algorithms because they run fast.
In theory, Support Vector Machines (SVM) should be the most popular classification method. Anything that exists can be classified using it: classifying plants by shape in images, classifying documents by category, etc.
The idea behind SVM is simple—it attempts to draw two lines between data points and maximize the distance between the two lines as much as possible. As illustrated below:

Classification algorithms have a very useful scenario—anomaly detection. If a certain feature cannot be assigned to any category, we mark it. This method is now used in the medical field—MRI (Magnetic Resonance Imaging), where the computer marks all suspicious areas or deviations within the detection range. The stock market uses it to detect traders’ abnormal behaviors to find insiders. While training the computer to identify what things are correct, we also automatically teach it to recognize what things are incorrect.
The rule of thumb indicates that the more complex the data, the more complex the algorithm. For text, numbers, and tables, I would choose classic methods to operate. These models are smaller, learn faster, and have clearer workflows. For images, videos, and other complex big data, I would definitely study neural networks.
Just five years ago, you could still find SVM-based face classifiers. Now, it’s easier to pick one model from hundreds of pre-trained neural network models. However, spam filters haven’t changed; they are still written with SVM, and there’s no reason to change it. Even my website uses SVM to filter spam in comments.
Regression
“Draw a line through these points, hmm~ this is machine learning”

Regression algorithms are currently used for:
-
Stock price prediction -
Supply and sales volume analysis -
Medical diagnosis -
Calculating time series correlations
-
Linear Regression -
Polynomial Regression
The “regression” algorithm is essentially a “classification” algorithm, except it predicts a value rather than a category. For example, predicting a car’s price based on mileage, estimating traffic volume at different times of the day, and predicting the degree of change in supply with company growth, etc. Regression algorithms are the best choice for handling time-related tasks.
Regression algorithms are favored by finance or analytics professionals. It has even become a built-in feature in Excel, making the entire process very smooth—the machine simply tries to draw a line representing the average correlation. However, unlike a person with a pen and whiteboard, the machine does this through mathematical precision by calculating the average distance of each point from the line.

If the line drawn is straight, it is “linear regression”; if the line is curved, it is “polynomial regression.” These are the two main types of regression. Other types are relatively rare. Don’t be fooled by the “Logistic regression” misnomer; it’s a classification algorithm, not a regression.
However, it’s okay to confuse “regression” with “classification.” Some classifiers can be adjusted to become regressors. In addition to defining the category of an object, one must also remember how close the object is to that category, which leads to regression problems. If you want to delve deeper, you can read the article“Machine Learning for Humans”[1](highly recommended).
1.2 Unsupervised Learning
Unsupervised learning appeared slightly later than supervised learning—in the 1990s, these algorithms were used relatively less, sometimes simply because there were no alternatives.
Having labeled data is a luxury. Suppose I want to create a—say, “bus classifier,” do I need to personally take millions of photos of those damn buses on the street and label each of these images one by one? No way, this would take my entire life, and I still have many games on Steam to play.
In this case, we still need to hold some hope for capitalism; thanks to the social crowdsourcing mechanism, we can obtain millions of cheap labor and services. For example, Mechanical Turk[2], where a group of people is always ready to help you complete tasks for a reward of $0.05. That’s usually how things get done.
Alternatively, you can try using unsupervised learning. But I don’t remember any best practices regarding it. Unsupervised learning is typically used for exploratory data analysis rather than as a primary algorithm. Those with degrees from Oxford and special training feed a bunch of garbage to the machine and then start observing: are there any clusters? No. Can we see some connections? No. Well, next, you still want to work in data science, right?
Clustering
“The machine will choose the best way to distinguish things based on some unknown features.”

Clustering algorithms are currently used for:
-
Market segmentation (customer types, loyalty) -
Merging adjacent points on a map -
Image compression -
Analyzing and labeling new data -
Detecting abnormal behavior
-
K-Means Clustering -
Mean-Shift -
DBSCAN
Clustering is done without pre-labeled categories. It’s like classifying socks even when you can’t remember all their colors. Clustering algorithms try to find similar things (based on certain features) and group them into clusters. Objects with many similar features gather together and are assigned to the same category. Some algorithms even support setting the exact number of data points in each cluster.
Here’s a good example of clustering—markers on online maps. When you look for nearby vegetarian restaurants, the clustering engine groups them and displays them with numbered bubbles. If it didn’t do that, the browser would freeze—because it would try to plot all 300 vegetarian restaurants in this trendy city on the map.
Apple Photos and Google Photos use more complex clustering methods to create albums of your friends by searching for faces in photos. The application does not know how many friends you have or what they look like, but it can still find common facial features. This is typical clustering.
Another common application is image compression. When an image is saved as PNG format, the colors can be set to 32 colors. This means the clustering algorithm must find all the “red” pixels, calculate the “average red,” and assign this average to all red pixels. Fewer colors, smaller files—what a deal!
However, it gets tricky with colors like . Is this green or blue? This is where the K-Means algorithm comes into play.
First, randomly select 32 color points as “cluster centers” from the colors, and label the remaining points according to the nearest cluster center. This gives us “star clusters” around the 32 color points. Then we move the cluster centers to the center of the “star clusters” and repeat the above steps until the cluster centers stop moving.
Done. We just managed to cluster into 32 stable clusters.
Let’s show you a real-life example:

Finding cluster centers this way is convenient. However, real-world clusters are not always circular. Suppose you are a geologist and need to find some similar ores on a map. In this case, the shape of the clusters may be strange, even nested. You might not even know how many clusters there are—10? 100?
K-means algorithm may not work here, but DBSCAN algorithm will come in handy. We treat data points as people in a square and ask any three people who are close to hold hands. Next, we tell them to grab the hands of neighbors they can reach (the people cannot move during the process), and repeat this step until new neighbors join in. This gives us the first cluster, and we repeat the process until everyone is assigned to a cluster. Done. A bonus: a person holding no one’s hand—an anomaly.
The whole process looks cool.
If you are interested in further understanding clustering algorithms, you can read this article“5 Clustering Algorithms Every Data Scientist Should Know”[3].
Like classification algorithms, clustering can be used to detect anomalies. Abnormal operations after a user logs in? Let the machine temporarily disable their account and create a ticket for the technical support team to check what’s going on. Maybe the other party is a “bot.” We don’t even have to know what “normal behavior” looks like; we just need to feed the user’s behavior data to the model and let the machine decide whether the other party is a “typical” user.
This method may not be as effective as classification algorithms, but it’s still worth a try.
Dimensionality Reduction
“Assembling specific features into higher-level features”

“Dimensionality reduction” algorithms are currently used for:
-
Recommendation systems -
Beautiful visualizations -
Topic modeling and finding similar documents -
Fake image recognition -
Risk management
-
Principal Component Analysis (PCA) -
Singular Value Decomposition (SVD) -
Latent Dirichlet Allocation (LDA) -
Latent Semantic Analysis (LSA, pLSA, GLSA) -
t-SNE (for visualization)
In the early years, “hardcore” data scientists used these methods, determined to discover “interesting things” in a pile of numbers. When Excel charts didn’t work, they forced the machine to do the pattern-finding work. Thus, they invented methods for dimensionality reduction or feature learning.
Projecting 2D data onto a line (PCA)
For people, abstract concepts are more convenient than a pile of fragmented features. For example, we can combine a dog with triangular ears, a long nose, and a big tail into the abstract concept of a “shepherd dog.” Compared to specific shepherd dogs, we indeed lose some information, but the new abstract concept is more useful in scenarios that require naming and explanation. As a bonus, these “abstract” models learn faster, use fewer features during training, and reduce overfitting.
These algorithms can shine in tasks of “topic modeling.” We can abstract their meanings from specific phrases. Latent Semantic Analysis (LSA) does this, based on the frequency of specific words you see on a certain topic. For example, technology-related vocabulary appears more frequently in tech articles, or politicians’ names mostly appear in political news, and so on.
We can create clusters directly from all the words in all articles, but doing so would lose all important connections (for example, the meanings of battery and accumulator are the same in different articles), and LSA can handle this well, which is why it is called “latent semantic.”
Therefore, we need to connect words and documents into a feature to maintain their latent relationships—people found that singular value decomposition (SVD) can solve this problem. Those useful topic clusters can be easily seen from grouped phrases.

Recommendation systems and collaborative filtering are another high-frequency area where dimensionality reduction algorithms are used. If you extract information from user ratings, you will get a great system to recommend movies, music, games, or anything you want.
Here’s a book I love: “Programming Collective Intelligence,” which was my bedside book during college.
It’s nearly impossible to fully understand this abstraction on machines, but you can pay attention to some correlations: some abstract concepts are related to user age—children play “Minecraft” or watch cartoons more often, while others may relate to movie styles or user preferences.
Based solely on information like user ratings, machines can identify these high-level concepts without even needing to understand them. Well done, Mr. Computer. Now we can write a paper on “Why Do Bearded Lumberjacks Like My Little Pony?”
Association Rule Learning
“Finding patterns in order streams”

“Association rules” are currently used for:
-
Predicting sales and discounts -
Analyzing “items bought together” -
Planning product displays -
Analyzing web browsing patterns
Common algorithms include:
-
Apriori -
Euclat -
FP-growth
Algorithms used to analyze shopping carts, automate marketing strategies, and other event-related tasks are here. If you want to discover some patterns from a sequence of items, give them a try.
For instance, a customer takes a six-pack of beer to the checkout. Should we place peanuts on the way to the register? How often do people buy beer and peanuts together? Yes, association rules are likely applicable to the beer and peanuts situation, but can we predict what other sequences? Can we make small changes in product layout that lead to significant profit increases?
This thinking also applies to e-commerce, where the tasks are even more interesting—What will the customer buy next?
I don’t know why rule learning seems rarely mentioned within the realm of machine learning. The classical approach applies trees or set methods based on a positive check of all purchased items. The algorithm can only search for patterns but cannot generalize or reproduce these patterns on new examples.
In the real world, every large retailer has established their own exclusive solutions, so this won’t bring you a revolution. The highest-level technology mentioned in this article is recommendation systems. However, I might not be aware of any breakthroughs in this area. If you have anything to share, please let me know in the comments.
This article is reproduced from the public accountDatawhale
Translator:Ahong, Source:dataxon
If you want to join the SCI, CSCD paper, and project research data statistical analysis group, please add the editor WeChat: tj211005, and the editor will pull you into the group.
This public account provides various research services
|