An Introduction to Machine Learning in Simple Terms

Machine learning is a topic everyone is discussing, but aside from teachers who have a deep understanding, very few can explain what it is clearly. If you read articles about machine learning online, you are likely to encounter two scenarios: dense academic texts filled with various theorems (I can barely handle half a theorem) or grand tales about artificial intelligence, data science magic, and the future of work.

I decided to write a long-awaited article that provides a simple introduction for those who want to understand machine learning. We will not discuss advanced principles but will use simple language to talk about real-world problems and practical solutions. Whether you are a programmer or a manager, you will be able to understand.

So let’s get started!

Why Do We Want Machines to Learn?

Enter Billy, who wants to buy a car and needs to figure out how much to save each month. After browsing dozens of ads online, he learns that new cars cost around $20,000, while a one-year-old used car costs $19,000, and a two-year-old car costs $18,000, and so on.

As a clever analyst, Billy discovers a pattern: the price of a car depends on its age, dropping by $1,000 for each additional year, but it will not go below $10,000.

In machine learning terms, Billy has invented “regression” – predicting a value (price) based on known historical data. When people try to estimate a reasonable price for a second-hand iPhone on eBay or calculate how many ribs to prepare for a barbecue, they are using a method similar to Billy’s – is it 200g per person? 500?

Yes, if there were a simple formula to solve all the world’s problems, especially for barbecue parties, that would be great – unfortunately, it is impossible.

Let’s return to the car-buying scenario. Now the question is, besides age, cars have different production dates, dozens of parts, technical conditions, seasonal demand fluctuations… who knows what other hidden factors… Ordinary Billy cannot consider all this data when calculating the price, and neither can I.

People are lazy and dumb – we need robots to help them do the math. Therefore, we adopt a computer method – provide the machine with some data and let it find all the potential patterns related to price.

Finally, it works! The most exciting part is that compared to a human carefully analyzing all the dependency factors in their mind, machines handle it much better.

Thus, machine learning was born.

The Three Components of Machine Learning

Leaving aside all the nonsense related to artificial intelligence (AI), the sole goal of machine learning is to predict outcomes based on input data, that’s it. All machine learning tasks can be represented this way; otherwise, it wouldn’t be a machine learning problem from the start.

The more diverse the samples, the easier it is to find correlated patterns and predict outcomes. Therefore, we need three components to train the machine:

Data

Want to detect spam? Get samples of spam messages. Want to predict stocks? Find historical price information. Want to identify user preferences? Analyze their activity records on Facebook (no, Mark, stop collecting data~ it’s enough). The more diverse the data, the better the outcome. For a machine to function properly, you need at least hundreds of thousands of lines of data.

There are two main ways to acquire data – manually or automatically. Manually collected data has fewer mixed errors but takes more time – and usually costs more. Automated methods are relatively cheap; you can collect all the data you can find (as long as the data quality is good).

Smart companies like Google use their users to label data for free. Remember ReCaptcha (human verification) forcing you to “select all the road signs”? That’s how they acquire data, still free labor! Well done. If I were them, I would display those verification images more frequently, but wait…

Good datasets are really hard to obtain; they are so important that some companies might even open their algorithms but rarely disclose datasets.

Features

Also known as “parameters” or “variables,” such as the mileage of a car, user gender, stock prices, or word frequency in a document. In other words, these are the factors the machine needs to consider.

If the data is stored in a tabular format, features correspond to column names, which is relatively simple. But what about a 100GB collection of cat images? We can’t treat every pixel as a feature. This is why selecting appropriate features often takes more time than other steps in machine learning, and feature selection is also a major source of errors. Human subjectivity leads people to select features they like or feel are “more important” – this should be avoided.

Algorithms

The most obvious part. Any problem can be solved in different ways. The method you choose will affect the final model’s accuracy, performance, and size. One thing to note: if the data quality is poor, even the best algorithm won’t help. This is known as “garbage in, garbage out” (GIGO). Therefore, before spending a lot of effort on accuracy, more data should be obtained.

Learning vs. Intelligence

I once saw an article titled “Will Neural Networks Replace Machine Learning?” on some popular media sites. These journalists always inexplicably exaggerate technologies like linear regression as “artificial intelligence,” almost calling it “Skynet.” The following diagram shows the relationship between several easily confused concepts.

“Artificial Intelligence” is the name of the entire discipline, similar to “Biology” or “Chemistry.”
“Machine Learning” is an important part of “Artificial Intelligence,” but not the only part.
“Neural Networks” are a branch method of machine learning; this method is quite popular, but there are other branches under the machine learning umbrella.
“Deep Learning” is a modern method for building, training, and using neural networks. Essentially, it is a new architecture. In current practice, no one distinguishes between deep learning and “ordinary networks,” as the libraries needed to use them are the same. To avoid looking like a fool, it’s best to directly specify the type of network and avoid buzzwords.

The general principle is to compare things at the same level. That’s why “neural networks will replace machine learning” sounds like “wheels will replace cars.” Dear media, this will hurt your reputation significantly.

What Machines Can Do	What Machines Cannot Do
Predict	Create New Things
Remember	Quickly Become Smart
Replicate	Go Beyond Task Scope
Select Optimal Options	Exterminate All Humanity

The Landscape of Machine Learning

If you’re too lazy to read long texts, the following image helps to gain some understanding.

In the world of machine learning, there is never just one way to solve a problem – it’s important to remember this – because you will always find several algorithms that can be used to solve a particular problem, and you need to choose the most suitable one. Of course, all problems can be handled using “neural networks,” but who will bear the hardware costs behind the computational power?

Let’s start with some basic overviews. Currently, machine learning mainly has four directions.

Part 1Classic Machine Learning Algorithms

Classic machine learning algorithms originated from pure statistics in the 1950s. Statisticians solved formal math problems such as finding patterns in numbers, estimating distances between data points, and calculating vector directions.

Today, half of the internet is studying these algorithms. When you see a column of “continue reading” articles or find your bank card locked at some remote gas station, it’s likely one of those little guys at work.

Large tech companies are staunch advocates of neural networks. The reason is obvious; for these large enterprises, a 2% increase in accuracy means an additional $2 billion in revenue. However, when a company is small, it’s not that important. I heard of a team that spent a year developing a new recommendation algorithm for their e-commerce site, only to find out later that 99% of their site traffic came from search engines – their algorithm was useless since most users wouldn’t even open the homepage.

Although classic algorithms are widely used, their principles are quite simple, and you can easily explain them to a toddler. They are like basic arithmetic – we use them every day without even thinking.

1.1 Supervised Learning

Classic machine learning is typically divided into two categories: Supervised Learning and Unsupervised Learning.

In “Supervised Learning,” there is a “supervisor” or “teacher” who provides the machine with all the answers to assist in learning, such as whether an image is of a cat or a dog. The “teacher” has already completed the dataset partitioning – labeling “cat” or “dog,” and the machine uses these example data to learn to distinguish between cats and dogs.

Unsupervised learning means the machine has to independently distinguish who is who among a bunch of animal pictures. The data is not pre-labeled, and there is no “teacher,” so the machine has to find all possible patterns on its own. This will be discussed later.

Clearly, when there is a “teacher” present, the machine learns faster, which is why supervised learning is more commonly used in real life. Supervised learning is divided into two categories:

Classification, predicting the category to which an object belongs;
Regression, predicting a specific point on a numerical axis;

Classification

“Classifying objects based on a known attribute, such as categorizing socks by color, classifying documents by language, or dividing music by style.”

Classification algorithms are commonly used for:

Filtering spam;
Language detection;
Finding similar documents;
Sentiment analysis;
Recognizing handwritten letters or numbers;
Fraud detection;

Common algorithms include:

Naive Bayes
Decision Trees
Logistic Regression
K-Nearest Neighbours
Support Vector Machine

Machine learning mainly addresses “classification” problems. This machine is like a toddler learning to classify toys: this is a “robot,” this is a “car,” this is a “machine-car”… um, wait, wrong! Wrong!

In classification tasks, you need a “teacher.” The data needs to be pre-labeled so that the machine can learn to classify based on these labels. Everything can be classified – categorizing users by interests, categorizing articles by language and topic (which is important for search engines), categorizing music by genre (Spotify playlists), and your emails are no exception.

The Naive Bayes algorithm is widely used for spam filtering. The machine counts the frequency of words like “Viagra” appearing in spam and normal emails, then applies Bayes’ theorem multiplied by their respective probabilities, and sums the results – ha, the machine has learned.

Later, spammers learned how to counteract Bayes filters – adding a lot of “good” words at the end of email content – this method is ironically referred to as “Bayesian poisoning.” Naive Bayes is recorded in history as the most elegant and one of the first practical algorithms, but there are now other algorithms to handle spam filtering issues.

Another example of a classification algorithm. Suppose you need to borrow some money; how does the bank know if you will repay it in the future? They can’t be sure. But the bank has many historical borrower profiles containing data such as “age,” “education level,” “occupation,” “salary,” and – most importantly – “whether they repaid.”

Using this data, we can train the machine to find patterns and draw conclusions. Finding the answer is not a problem; the issue is that banks cannot blindly trust the answers given by machines. What if the system fails, is hacked, or a drunken graduate just patched the system?

To deal with this issue, we need to use decision trees, where all data is automatically divided into “yes/no” questions – for example, “Does the borrower’s income exceed $128.12?” – it sounds a bit inhumane. However, the machine generates such questions to optimally partition the data at each step.

That’s how the “tree” is created. The higher the branches (closer to the root node), the broader the scope of the question. All analysts can accept this practice and provide explanations afterward, even if they don’t understand how the algorithm works; they can easily explain the results (typical analysts!).

Decision trees are widely used in high-stakes scenarios: diagnostics, medicine, and finance.

The two most well-known decision tree algorithms are CART and C4.5.

Today, pure decision tree algorithms are rarely used. However, they are the foundation of large systems, and the performance of ensemble decision trees can even surpass that of neural networks. We will discuss this later.

When you search on Google, it is a bunch of clumsy “trees” helping you find answers. Search engines love these algorithms because they run fast.

In theory, Support Vector Machine (SVM) should be the most popular classification method. Anything that exists can be classified using it: classifying plants by shape in images, classifying documents by category, etc.

The idea behind SVM is very simple – it tries to draw two lines between data points and maximize the distance between the two lines as much as possible. As illustrated below:

Classification algorithms have a very useful scenario – anomaly detection. If a certain feature cannot be assigned to any category, we flag it. This method is now used in medicine – in MRI (Magnetic Resonance Imaging), where the computer marks all suspicious areas or deviations within the detection range. The stock market uses it to detect traders’ anomalous behavior to find insiders. When training the computer to distinguish what is correct, we also automatically teach it to recognize what is incorrect.

The rule of thumb indicates that the more complex the data, the more complex the algorithm. For text, numbers, and tables, I would choose classic methods to operate. These models are smaller, learn faster, and have clearer workflows. For images, videos, and other complex big data, I would definitely study neural networks.

Just five years ago, you could find face classifiers based on SVM. Now, it’s easier to choose a model from hundreds of pre-trained neural network models. However, spam filters have not changed; they are still written using SVM, and there is no reason to change that. Even my website uses SVM to filter out spam in comments.

Regression

“Draw a line through these points, hmm~ that’s machine learning.”

Regression algorithms are currently used for:

Stock price prediction;
Supply and sales volume analysis;
Medical diagnosis;
Calculating time series correlations;

Common regression algorithms include:

Linear Regression
Polynomial Regression

The “regression” algorithm is essentially also a “classification” algorithm, but it predicts a value instead of a category. For example, predicting a car’s price based on mileage, estimating traffic volume at different times of the day, and predicting the fluctuations in supply volume as the company grows. Regression algorithms are the best choice when dealing with time-related tasks.

Regression algorithms are favored by professionals in finance or analytics. They have even become a built-in function in Excel, and the entire process is very smooth – the machine simply tries to draw a line representing the average correlation. However, unlike a person with a pen and whiteboard, the machine accomplishes this by calculating the average distance of each point to the line.

If the line drawn is straight, it’s “linear regression,” and if it’s curved, it’s “polynomial regression.” They are the two main types of regression. Other types are less common. Don’t be misled by the “Logistic Regression” outlier; it is a classification algorithm, not regression.

However, it’s okay to mix up “regression” and “classification.” Some classifiers can become regression models after parameter adjustments. Besides defining the category of an object, it’s also important to remember how close the object is to that category, which leads to the regression problem. If you want to delve deeper, you can read the article“Machine Learning for Humans”^[1](highly recommended).

1.2 Unsupervised Learning

Unsupervised learning appeared slightly later than supervised learning – in the 1990s, this type of algorithm was used relatively rarely, sometimes simply because there were no alternatives.

Labeled data is a luxury. Suppose I want to create a – let’s say “bus classifier,” do I need to personally take millions of photos of those damned buses and then label each one? No way, that would take a lifetime; I have a lot of games on Steam I still haven’t played.

In such cases, we still have to have a bit of hope for capitalism; thanks to the social crowdsourcing mechanism, we can obtain millions of cheap labor and services. For example, Mechanical Turk^[2], which is backed by a group of people ready to help you complete tasks for $0.05. That’s usually how things get done.

Or, you can try using unsupervised learning. However, I don’t recall any best practices regarding it. Unsupervised learning is usually used for exploratory data analysis, rather than as a primary algorithm. Those with degrees from Oxford and special training feed a bunch of garbage to the machine and start observing: Are there any clusters? No. Can we see some connections? No. Well, you still want to work in data science, right?

Clustering

“Machines will choose the best way to distinguish things based on some unknown features.”

Clustering algorithms are currently used for:

Market segmentation (customer types, loyalty);
Merging nearby points on a map;
Image compression;
Analyzing and labeling new data;
Detecting anomalous behavior;

Common algorithms include:

K-Means Clustering
Mean-Shift
DBSCAN

Clustering is performed without pre-labeled categories. It’s like categorizing socks when you can’t remember all their colors. Clustering algorithms attempt to find similar things (based on certain features) and group them into clusters. Objects with many similar features are grouped together and assigned to the same category. Some algorithms even support setting the exact number of data points in each cluster.

Here’s a good example of clustering – markers on an online map. When you search for nearby vegetarian restaurants, the clustering engine groups them and displays them using numbered bubbles. Without this, the browser would freeze – because it’s trying to plot all 300 vegetarian restaurants in that trendy city on the map.

Apple Photos and Google Photos use a more complex clustering approach. They create albums of your friends by searching for faces in photos. The application doesn’t know how many friends you have or what they look like, but it can still find common facial features. This is typical clustering.

Another common application scenario is image compression. When an image is saved in PNG format, the color can be set to 32 colors. This means the clustering algorithm needs to find all the “red” pixels, calculate the “average red,” and assign this average to all red pixels. Fewer colors mean smaller files – a good deal!

However, it becomes tricky with colors like . Is this green or blue? This is when the K-Means algorithm comes into play.

First, randomly select 32 color points from the palette as “cluster centers,” then label the remaining points according to the nearest cluster center. This gives us “star clusters” around the 32 color points. Next, we move the cluster centers to the “centers of the stars” and repeat the above steps until the cluster centers no longer move.

Done. We have exactly 32 stable clusters.

Here’s a real-life example:

Finding cluster centers is convenient, but real-world clusters are not always circular. Suppose you are a geologist and need to find some similar ores on a map. In this case, the shapes of the clusters can be very strange, even nested. You may not even know how many clusters there are – 10? 100?

K-means algorithm won’t work here, but DBSCAN algorithm will. We treat data points as people in a square, asking any three people standing close to hold hands. Next, we tell them to grab the hands of any neighbors they can reach (the whole process must keep the people’s standing positions unchanged), and repeat this step until new neighbors join. This gives us the first cluster, and we repeat the process until everyone is assigned to a cluster. Done. An unexpected gain: a person with no one holding their hands – an outlier.

The whole process looks cool.

Interested in learning more about clustering algorithms? You can read this article“5 Clustering Algorithms Every Data Scientist Should Know”^[3].

Like classification algorithms, clustering can also be used for anomaly detection. Is there abnormal behavior after a user logs in? Let the machine temporarily disable their account and create a ticket for technical support to check what’s going on. Perhaps the other party is a “bot.” We don’t even need to know what “normal behavior” looks like; we just need to feed the user behavior data to the model and let the machine decide if the other party is a “typical” user.

This method is not as effective as classification algorithms but is still worth a try.

Dimensionality Reduction

“Assembling specific features into higher-level features.”

“Dimensionality Reduction” algorithms are currently used for:

Recommendation systems;
Beautiful visualizations;
Topic modeling and finding similar documents;
Fake image detection;
Risk management;

Common “dimensionality reduction” algorithms include:

Principal Component Analysis (PCA)
Singular Value Decomposition (SVD)
Latent Dirichlet Allocation (LDA)
Latent Semantic Analysis (LSA, pLSA, GLSA)
t-SNE (for visualization)

In the past, “hardcore” data scientists used these methods, determined to discover “interesting things” in a pile of numbers. When Excel charts didn’t work, they forced the machines to do the pattern-finding work. Thus, they invented dimensionality reduction or feature learning methods.

Projecting 2D data onto a line (PCA)

For people, abstract concepts are more convenient compared to a pile of fragmented features. For example, we can combine a dog with triangular ears, a long nose, and a big tail into the abstract concept of a “shepherd dog.” Compared to specific shepherd dogs, we do lose some information, but the new abstract concept is more useful in scenarios where naming and explanation are needed. As a reward, these “abstract” models learn faster, require fewer features during training, and reduce overfitting.

These algorithms can shine in tasks like “topic modeling.” We can abstract meanings from specific phrases. Latent Semantic Analysis (LSA) does this by analyzing the frequency of specific words you can see on a topic. For instance, technology articles will have more technology-related vocabulary, or a politician’s name mostly appears in politically related news, and so on.

We can create clusters from all the words in all articles, but doing so would lose all important connections (for example, in different articles, battery and accumulator mean the same thing), and LSA can handle this problem well, which is why it’s called “latent semantics.”

Therefore, we need to connect words and documents into a feature to maintain their latent connections – people found that Singular Value Decomposition (SVD) can solve this problem. Those useful topic clusters are easily seen from grouped phrases.

Recommendation systems and collaborative filtering are another frequent area of dimensionality reduction algorithms. If you use it to extract information from user ratings, you will get a great system to recommend movies, music, games, or anything you want.

Here’s a book I love, “Programming Collective Intelligence,” which was my bedside book during college.

It’s nearly impossible to fully understand these abstractions on machines, but you can observe some correlations: some abstract concepts are related to user age – children play “Minecraft” or watch cartoons more, while others may relate to movie styles or user preferences.

Just based on user ratings, the machine can identify these high-level concepts without needing to understand them. Well done, Mr. Computer. Now we can write a paper on “Why Do Bearded Lumberjacks Like My Little Pony.”

Association Rule Learning

“Finding patterns in order flows.”

“Association rules” are currently used for:

Predicting sales and discounts;
Analyzing products bought “together”;
Planning product displays;
Analyzing web browsing patterns;

Common algorithms include:

Apriori
Euclat
FP-growth

Algorithms used to analyze shopping carts, automate marketing strategies, and other event-related tasks are all here. If you want to discover some patterns from a sequence of items, give them a try.

For example, a customer takes a six-pack of beer to the checkout. Should we place peanuts along the way? How often do people buy beer and peanuts together? Yes, association rules are likely applicable to the case of beer and peanuts, but can we predict what other sequences? Can we make small changes in product layout that lead to significant profit increases?

This idea also applies to e-commerce, where the tasks are more interesting – What will the customer buy next?

For some reason, rule learning seems to be rarely mentioned within the realm of machine learning. The classic approach is to apply trees or set methods based on a positive inspection of all purchased items. The algorithm can only search for patterns but cannot generalize or reproduce these patterns on new examples.

In the real world, every large retailer has established its own exclusive solutions, so this won’t bring you a revolution. The highest level of technology mentioned in this article is the recommendation system. However, I might not be aware of any breakthroughs in this area. If you have something to share, please let me know in the comments.

Editor / Zhang Zhihong

Reviewer / Fan Ruiqiang

Revised / Zhang Zhihong

Source: Mathematics China