Machine learning is a hot topic, but apart from the teachers who know it well, very few people can explain what it’s all about. If you read articles about machine learning online, you’re likely to encounter one of two situations: either a heavy academic trilogy filled with various theorems (I struggle to grasp even half of a theorem) or dazzling stories about artificial intelligence, data science magic, and the future of work.
I decided to write a long-awaited article to give a simple introduction to those who want to understand machine learning. It won’t involve advanced principles; instead, I’ll use simple language to discuss real-world problems and practical solutions. Whether you’re a programmer or a manager, you should be able to understand it. So let’s get started!
Why Do We Want Machines to Learn?
Enter Billy, who wants to buy a car and needs to calculate how much he needs to save each month to afford it. After browsing dozens of ads online, he learns that a new car costs around $20,000, a one-year-old used car costs $19,000, and a two-year-old car costs $18,000, and so on.
As a clever analyst, Billy discovers a pattern: the price of a car depends on its age, decreasing by $1,000 for each year, but it won’t go below $10,000.
In machine learning terms, Billy has invented “regression” — predicting a value (price) based on known historical data. When people try to estimate a reasonable price for a used iPhone on eBay or calculate how many ribs to prepare for a barbecue party, they are using a method similar to Billy’s — 200g per person? 500?
Yes, it would be great to have a simple formula to solve all the world’s problems — especially for barbecue parties — but unfortunately, that’s not possible.
Let’s return to the car-buying scenario. Now the problem is, besides the car’s age, there are different production dates, dozens of accessories, technical conditions, seasonal demand fluctuations… who knows what other hidden factors there are… Ordinary people like Billy can’t consider all this data when calculating prices; I wouldn’t be able to either.
People are both lazy and dumb — we need robots to help them do the math. Therefore, we use a computational approach — providing machines with data to find all potential patterns related to price.
Finally, it works! The most exciting part is that compared to a human carefully analyzing all the dependent factors in their mind, machines handle it much better.
This is how machine learning was born.
Three Components of Machine Learning
Setting aside all the nonsense related to artificial intelligence (AI), the sole goal of machine learning is to predict outcomes based on input data, plain and simple. All machine learning tasks can be represented this way; otherwise, it isn’t a machine learning problem from the start.
The more diverse the samples, the easier it is to find relevant patterns and predict outcomes. Therefore, we need three components to train the machine:
(1) Data
Want to detect spam? Get samples of spam messages. Want to predict stock prices? Find historical price information. Want to identify user preferences? Analyze their activity records on Facebook (no, Mark, stop collecting data — it’s enough). The more diverse the data, the better the results. For a machine that needs to function, at least hundreds of thousands of lines of data are necessary.
There are two main ways to obtain data — manually or automatically. Manually collected data has fewer mixed errors but takes more time — usually costing more as well. Automated methods are relatively cheap; you can collect all the data you can find (hopefully, the data quality is good).
Smart companies like Google leverage their users to label data for free. Remember ReCaptcha (human verification) forcing you to “select all the road signs”? That’s how they gather data, using free labor! Well done. If I were them, I’d show those verification images more often, but wait…
Good datasets are really hard to obtain; they are so important that some companies might even open their algorithms but rarely disclose their datasets.
(2) Features
Also known as “parameters” or “variables,” such as the mileage of a car, user gender, stock price, word frequency in a document, etc. In other words, these are the factors the machine needs to consider. If the data is stored in a tabular format, features correspond to column names, which is relatively simple. But what if it’s a 100GB collection of cat images? We can’t treat every pixel as a feature. That’s why selecting appropriate features often takes more time than other steps in machine learning, and feature selection is also a major source of error. Subjective tendencies in humans lead them to choose features they like or feel are “more important” — this should be avoided.
(3) Algorithms
The most obvious part. Any problem can be solved in different ways. The method you choose will affect the final model’s accuracy, performance, and size. One thing to note is that if the data quality is poor, even the best algorithm won’t help. This is known as “garbage in, garbage out” (GIGO). Therefore, before putting a lot of thought into accuracy, more data should be obtained.
Learning V.S. Intelligence
I once saw an article titled “Will Neural Networks Replace Machine Learning?” on some popular media sites. These journalists always inexplicably exaggerate technologies like linear regression as “artificial intelligence,” almost calling it “Skynet.” The following diagram shows the relationships among several easily confused concepts.
-
“Artificial Intelligence” is the name of the entire discipline, similar to “biology” or “chemistry.”
-
“Machine Learning” is an essential component of “artificial intelligence,” but not the only part.
-
“Neural Networks” are a branch method of machine learning, which is quite popular, but there are other branches under the machine learning umbrella.
-
“Deep Learning” is a modern approach to building, training, and using neural networks. Essentially, it’s a new architecture. In current practice, no one distinguishes deep learning from “ordinary networks”; the libraries needed to call them are the same. To avoid looking foolish, you better specify the exact type of network instead of using buzzwords.
The general principle is to compare things at the same level. That’s why “neural networks will replace machine learning” sounds like “wheels will replace cars.” Dear media, this will seriously damage your reputation.
The Landscape of Machine Learning
If you’re too lazy to read long paragraphs, the following image helps to gain some understanding.
In the world of machine learning, there is never just one way to solve a problem — it’s important to remember this — because you’ll always find several algorithms that can be used to solve a particular problem, and you need to choose the one that fits best. Of course, all problems can be handled with “neural networks,” but who will bear the hardware costs that carry the computing power?
Let’s start with some basic overviews. Currently, machine learning mainly has four directions.
Part 1: Classic Machine Learning Algorithms
Classic machine learning algorithms originate from pure statistics in the 1950s. Statisticians addressed formal math problems like finding patterns in numbers, estimating distances between data points, and calculating vector directions.
Today, half of the internet studies these algorithms. When you see a column of “continue reading” articles or find your bank card locked at a remote gas station, it’s likely one of these little guys at work.
Large tech companies are staunch advocates of neural networks. The reason is clear: for these large enterprises, a 2% accuracy improvement means an additional $2 billion in revenue. However, when a company’s business volume is small, it becomes less critical. I’ve heard of teams spending a year developing new recommendation algorithms for their e-commerce sites, only to find out later that 99% of their traffic comes from search engines — their algorithm was useless since most users wouldn’t even open the homepage.
Although classic algorithms are widely used, their principles are simple, and you can easily explain them to a toddler. They are like basic arithmetic — we use them every day without even thinking.
1.1 Supervised Learning
Classic machine learning is generally divided into two categories: supervised learning and unsupervised learning.
In “supervised learning,” there is a “supervisor” or “teacher” who provides the machine with all the answers to aid learning, such as whether the image is of a cat or a dog. The “teacher” has already labeled the dataset — marking it as “cat” or “dog,” and the machine uses these example data to learn to distinguish cats from dogs.
Unsupervised learning means that the machine has to independently distinguish who is who among a pile of animal images. The data is not pre-labeled, and there is no “teacher”; the machine has to find all possible patterns by itself. This will be discussed later.
Clearly, the presence of a “teacher” allows the machine to learn faster, so supervised learning is more commonly used in real life.
Supervised learning is divided into two categories:
-
Classification, predicting the category to which an object belongs;
-
Regression, predicting a specific point on a numerical axis;
“Classifying objects based on a known attribute, such as sorting socks by color, classifying documents by language, or categorizing music by style.”
Classification algorithms are commonly used for:
-
-
-
Finding similar documents;
-
-
Recognizing handwritten letters or numbers;
-
Common algorithms include:
Machine learning primarily addresses “classification” problems. This machine is like a toddler learning to classify toys: this is a “robot,” this is a “car,” this is a “machine-car”… wait, error! Error!
In classification tasks, you need a “teacher.” The data needs to be pre-labeled so that the machine can learn to classify based on these labels. Everything can be classified — users can be categorized by interests, articles can be categorized by language and topic (which is crucial for search engines), and music can be categorized by type (Spotify playlists), and your emails are no exception.
The Naive Bayes algorithm is widely used for spam filtering. The machine counts the frequency of terms like “Viagra” in both spam and normal emails, applies Bayes’ theorem multiplied by their respective probabilities, and sums the results — ha, the machine has learned.
Later, spammers learned how to counteract the Bayesian filter by adding many “good” words at the end of their emails — this method is ironically termed “Bayesian poisoning.” Naive Bayes is recorded in history as the most elegant and the first practical algorithm, but now there are other algorithms to handle spam filtering issues.
Another example of a classification algorithm.
If you need to borrow money now, how does the bank know if you’ll pay it back in the future? They can’t be sure. However, the bank has many historical borrower files containing data such as “age,” “education level,” “occupation,” “salary,” and — most importantly — “whether they paid back.”
Using this data, we can train the machine to find patterns and derive answers. Finding answers is not the problem; the issue is that banks cannot blindly trust the answers given by the machine. What if the system fails, gets hacked, or a drunk graduate just patched the system? What should they do?
To address this issue, we need to use decision trees, where all data is automatically divided into “yes/no” questions — for example, “Does the borrower’s income exceed $128.12?” — it sounds a bit inhumane. However, the machine generates such questions to optimally partition the data at each step.
This is how the “tree” is formed. The higher the branch (closer to the root node), the broader the range of the question. All analysts can accept this method and explain the results afterward, even if they don’t fully understand how the algorithm works; they can easily explain the results (typical analysts!).
Decision trees are widely used in high-stakes scenarios: diagnostics, medicine, and finance.
The two most well-known decision tree algorithms are CART and C4.5.
Today, pure decision tree algorithms are rarely used. However, they are the foundation of large systems, and the results after integrating decision trees can even outperform neural networks. We’ll discuss this later.
When you search on Google, it’s a bunch of clumsy “trees” helping you find answers. Search engines love these algorithms because they run fast.
In theory, Support Vector Machines (SVM) should be the most popular classification method. Anything that exists can be classified using it: classifying plants in images by shape, classifying documents by category, etc.
The idea behind SVM is simple — it tries to draw two lines between data points and maximizes the distance between the two lines as much as possible.
Classification algorithms have a very useful scenario — anomaly detection. If a certain feature cannot be assigned to any category, we flag it. This method is already used in the medical field — in MRI (Magnetic Resonance Imaging), the computer marks all suspicious areas or deviations within the detection range. The stock market uses it to detect abnormal behavior of traders to find insiders. When training the computer to distinguish what is correct, we also automatically teach it to recognize what is wrong.
The rule of thumb indicates that the more complex the data, the more complex the algorithm. For text, numbers, and tables, I would choose classic methods. These models are smaller, learn faster, and have clearer workflows. For images, videos, and other complex big data, I would definitely explore neural networks.
Just five years ago, you could still find face classifiers based on SVM. Now, it’s easier to pick one model from hundreds of pre-trained neural network models. However, spam filters haven’t changed; they are still written using SVM, and there’s no reason to change it. Even my website uses SVM to filter spam in comments.
Regression
“Draw a line through these points, hmm~ this is machine learning”
Regression algorithms are currently used for:
-
-
Supply and sales volume analysis;
-
-
Calculating time series correlations;
Common regression algorithms include:
The “regression” algorithm is essentially a “classification” algorithm; it just predicts a value rather than a category. For example, predicting the price of a car based on mileage, estimating traffic volume at different times of the day, and predicting the extent of supply changes as a company grows, etc. Regression algorithms are the best choice for dealing with time-related tasks.
Regression algorithms are favored by finance or analytics professionals. They have even become a built-in feature in Excel, making the entire process very smooth — the machine simply tries to draw a line representing the average correlation. However, unlike a person with a pen and whiteboard, the machine accomplishes this by calculating the average distance of each point from the line with mathematical precision.
If the line drawn is straight, it is “linear regression”; if the line is curved, it is “polynomial regression.” These are the two main types of regression. Other types are relatively rare. Don’t be fooled by the “Logistic regression” which is a troublemaker; it is a classification algorithm, not regression.
However, it’s okay to confuse “regression” and “classification.” Some classifiers can become regression models after adjusting parameters. In addition to defining the object’s category, it’s also important to remember how close the object is to that category, which introduces the regression problem.
If you want to delve deeper, you can read the article “Machine Learning for Humans” [1] (highly recommended).
1.2 Unsupervised Learning
Unsupervised learning appeared slightly later than supervised learning — in the 1990s, this type of algorithm was used relatively less, sometimes just because there were no other options.
Having labeled data is a luxury. Suppose I want to create a — let’s say “bus classifier”; does that mean I have to personally take millions of photos of those damned buses on the street and label each one? No way, that would take me a lifetime, and I still have many games on Steam to play.
In this case, there’s still some hope for capitalism; thanks to the social crowdsourcing mechanism, we can obtain millions of cheap labor and services. For example, Mechanical Turk [2] consists of a group of people ready to help you complete tasks for $0.05. That’s usually how things get done.
Or, you can try using unsupervised learning. However, I don’t recall any best practices regarding it. Unsupervised learning is typically used for exploratory data analysis rather than as a primary algorithm. Those with degrees from Oxford and special training feed the machine a bunch of garbage and then start observing: are there any clusters? No. Can we see some connections? No. Well, you still want to work in data science, right?
Clustering
“The machine will choose the best way to distinguish things based on some unknown features.”
Clustering algorithms are currently used for:
-
Market segmentation (customer types, loyalty);
-
Merging nearby points on a map;
-
-
Analyzing and labeling new data;
-
Detecting abnormal behavior;
Common algorithms include:
Clustering is performed without pre-labeled categories. It’s like classifying socks even if you can’t remember all their colors. Clustering algorithms attempt to find similar items (based on certain features) and group them into clusters. Objects with many similar features gather together and are classified into the same category. Some algorithms even support setting the exact number of data points in each cluster.
Here’s a good example of clustering — markers on online maps. When you search for nearby vegetarian restaurants, the clustering engine groups them and displays them with numbered bubbles. If it didn’t, the browser would freeze — trying to plot all 300 vegetarian restaurants in this trendy city on the map.
Apple Photos and Google Photos use more complex clustering methods. They create albums of your friends by searching for faces in photos. The application doesn’t know how many friends you have or what they look like but can still find common facial features. This is typical clustering.
Another common application is image compression. When an image is saved in PNG format, the colors can be set to 32 colors. This means the clustering algorithm has to find all the “red” pixels, calculate the “average red,” and assign this average to all red pixels. Fewer colors mean a smaller file — a win!
However, it gets tricky with colors like blue and green. Is it green or blue? This is when the K-Means algorithm comes in.
First, randomly select 32 color points as “cluster centers” from the colors, and label the remaining points according to the nearest cluster center. This gives us “star clusters” around the 32 color points. Then we move the cluster centers to the center of the “star clusters” and repeat the process until the cluster centers stop moving.
Done. We just clustered into 32 stable clusters.
Let me show you a real-life example:
Finding cluster centers this way is convenient, but real-world clusters are not always circular. Suppose you are a geologist needing to find similar ores on a map. In this case, the shape of the clusters can be strange, even nested. You might not even know how many clusters there are, 10? 100?
K-means algorithm won’t work here, but DBSCAN algorithm will. We compare data points to people in a square, asking any three people standing close to hold hands. Then tell them to grab the hands of any reachable neighbors (the people must not move), repeating this process until new neighbors join. This way, we get the first cluster and repeat the process until everyone is assigned to a cluster, done.
An unexpected gain: a person without anyone holding hands — an anomaly data point.
The whole process looks cool.
Interested in learning more about clustering algorithms? You can read the article “5 Clustering Algorithms Every Data Scientist Should Know” [3].
Like classification algorithms, clustering can be used for anomaly detection. Is there abnormal behavior after a user logs in? The machine temporarily disables their account and creates a ticket for technical support to check what’s going on. Maybe they are a “robot.” We don’t even need to know what “normal behavior” looks like; just feed the user behavior data to the model and let the machine decide if the other party is a “typical” user. This method, although not as effective as classification algorithms, is still worth trying.
Dimensionality Reduction
“Assembling specific features into higher-level features”
“Dimensionality reduction” algorithms are currently used for:
-
-
Beautiful visualizations;
-
Topic modeling and finding similar documents;
-
-
Common “dimensionality reduction” algorithms include:
-
Principal Component Analysis (PCA);
-
Singular Value Decomposition (SVD);
-
Latent Dirichlet Allocation (LDA);
-
Latent Semantic Analysis (LSA, pLSA, GLSA);
-
t-SNE (for visualization);
Years ago, “hardcore” data scientists would use these methods, determined to find “interesting things” in a pile of numbers. When Excel charts didn’t work, they forced the machine to do pattern finding. Thus, they invented dimensionality reduction or feature learning methods.
Projecting 2D data onto a line (PCA)
For people, abstract concepts are more convenient than a pile of fragmented features. For example, we can combine dogs with triangular ears, long noses, and big tails into the abstract concept of “sheepdog.” Compared to specific sheepdogs, we lose some information, but the new abstract concept is more useful in scenarios where naming and explaining are required. As a bonus, these “abstract” models learn faster, require fewer features during training, and reduce overfitting.
These algorithms shine in tasks like “topic modeling.” We can abstract meanings from specific phrases. Latent Semantic Analysis (LSA) does this, based on the frequency of specific words you see on a topic. For instance, technology-related terms are likely to appear more frequently in technology articles, or the names of politicians mostly appear in political news, and so on.
We could create clusters from all words in all articles, but doing so would lose all important connections (for example, the meanings of battery and accumulator are the same in different articles); LSA handles this problem well, which is why it’s called “latent semantics.”
Thus, we need to connect words and documents into a feature that maintains their latent relationships — people found that Singular Value Decomposition (SVD) can solve this issue. Those useful topic clusters are easy to see from the grouped phrases.

Recommendation systems and collaborative filtering are another field where dimensionality reduction algorithms are frequently used. If you extract information from user ratings, you’ll get a great system to recommend movies, music, games, or anything else you want.
Here’s a book I love, “Programming Collective Intelligence,” which was my bedside book during college.
Understanding this kind of abstraction on machines is almost impossible, but you can pay attention to some correlations: some abstract concepts are related to user age — children play “Minecraft” or watch cartoons more, while others might relate to movie genres or user preferences.
Based solely on information like user ratings, the machine can identify these high-level concepts without needing to understand them. Well done, Mr. Computer. Now we can write a paper on “Why Bearded Lumberjacks Love My Little Pony.”
Association Rule Learning
“Finding patterns in order flows”
“Association rules” are currently used for:
-
Predicting sales and discounts;
-
Analyzing products purchased together;
-
Planning product displays;
-
Analyzing web browsing patterns;
Common algorithms include:
Algorithms used to analyze shopping carts, automate marketing strategies, and other event-related tasks are here. If you want to discover patterns from a sequence of items, try them out.
For example, a customer takes a six-pack of beer to the checkout. Should we place peanuts along the way? How often do people buy beer and peanuts together? Yes, association rules might apply to the beer and peanuts scenario, but what other sequences can we predict? Can we make minor changes to product layout that lead to significant profit increases?
This thinking also applies to e-commerce, where the tasks become more interesting — what will the customer buy next?
For some reason, rule learning seems seldom mentioned within the realm of machine learning. The classic approach applies trees or ensemble methods based on a positive check of all purchased items. The algorithm can only search for patterns but cannot generalize or reproduce these patterns on new examples.
In the real world, every major retailer has established their own dedicated solution, so this won’t bring you any revolution. The highest-level technology mentioned in this article is the recommendation system. However, I might not be aware of any breakthroughs in this area. If you have anything to share, please let me know in the comments.
Original article: https://valyrics.vas3k.com/blog/machine_learning
Source: Computer Education