Differences Between Statistics and Machine Learning

Source: Not Just Data Analysis

This article is about 5800 words long, and it is recommended to read for over 10 minutes.
Without statistics, machine learning cannot exist, but due to the contemporary information explosion and the vast amount of data humans can access, machine learning is extremely useful.

The distinction between statistics and machine learning has always been vague.Both in industry and academia, it has long been believed that machine learning is merely a shiny coat on statistics.Artificial intelligence, supported by machine learning, is also referred to as “the extension of statistics.” For example, Nobel laureate Thomas Sargent once said that artificial intelligence is actually just statistics, but dressed in elegant language.Sargent stated at the World Innovation Forum that artificial intelligence is indeed just statistics.Of course, there are some differing opinions. However, the arguments on both sides of this viewpoint are filled with seemingly profound but actually vague discussions, making it hard to grasp.A Harvard PhD student named Matthew Stewart has argued from two perspectives: the differences between statistics and machine learning, and the differences between statistical models and machine learning, demonstrating that machine learning and statistics are not interchangeable terms.The main difference between machine learning and statistics lies in their objectives.Contrary to what most people think, machine learning has actually existed for decades. It was initially abandoned because the computing power at that time could not meet its demand for large-scale calculations. However, in recent years, due to the data and computational power advantages brought by the information explosion, machine learning is rapidly reviving.Back to the point, if machine learning and statistics were interchangeable terms, why haven’t we seen every university’s statistics department close down and switch to ‘machine learning’ departments? Because they are different!I often hear vague discussions on this topic, the most common being the statement:“The main difference between machine learning and statistics lies in their objectives. Machine learning models aim to make the most accurate predictions possible. Statistical models are designed to infer the relationships between variables.”While this is technically correct, such statements do not provide a particularly clear and satisfying answer. One major difference between machine learning and statistics is indeed their objectives.However, saying that machine learning is about accurate predictions while statistical models are designed for inference is almost a meaningless statement unless you are truly proficient in these concepts.First, we must understand that statistics and statistical modeling are not the same. Statistics is the mathematical study of data. Without data, statistics cannot be performed. Statistical models are models of data, mainly used to infer the relationships between different contents in the data or create models that can predict future values. Typically, these two are complementary.Therefore, we actually need to discuss from two aspects: first, how statistics differs from machine learning; second, how statistical models differ from machine learning.To put it more straightforwardly, there are many statistical models that can make predictions, but their predictive performance is often unsatisfactory.Machine learning, on the other hand, typically sacrifices interpretability for powerful predictive capabilities. For instance, from linear regression to neural networks, while interpretability decreases, predictive power significantly increases.From a macro perspective, this is a good answer. At least it is good enough for most people. However, in some cases, this statement can easily lead us to misunderstand the differences between machine learning and statistical modeling. Let’s look at the example of linear regression.Differences Between Statistical Models and Machine Learning in Linear RegressionPerhaps due to the similarity in methods used in statistical modeling and machine learning, people believe they are the same thing. I can understand this, but in fact, they are not.The most obvious example is linear regression, which may be the main reason for this misunderstanding. Linear regression is a statistical method through which we can both train a linear regression model and fit a statistical regression model using the least squares method.In this case, the former refers to “training” the model, which only uses a subset of the data, and the performance of the trained model needs to be tested on another subset of data called the test set. In this example, the ultimate goal of machine learning is to achieve the best performance on the test set.For the latter, we assume in advance that the data is a linear regression quantity with Gaussian noise, and then we try to find a line that minimizes the mean squared error of all the data. There is no need for a training or test set; in many cases, especially in research (like the sensor example below), the purpose of modeling is to describe the relationship between data and output variables, rather than to predict future data. We call this process statistical inference, not prediction. Although we can use this model for prediction, which might be what you are thinking, the method of evaluating the model is no longer the test set but rather the significance and robustness of the model parameters.The purpose of machine learning (specifically referring to supervised learning here) is to obtain a model that can predict repeatedly. We usually do not care whether the model can be interpreted. Machine learning only cares about results. It is like how a company values you solely based on your performance. Statistical modeling, on the other hand, is more about finding relationships between variables and determining the significance of those relationships, which conveniently aligns with prediction.Let me provide an example to illustrate the difference between the two. I am an environmental scientist. My main job involves dealing with sensor data. If I try to prove that a sensor can respond to a certain stimulus (such as gas concentration), I will use a statistical model to determine whether the signal response has statistical significance. I will try to understand this relationship and test its reproducibility to accurately describe the sensor’s response and make inferences based on this data. I might also test whether the response is linear? Is the response attributed to the gas concentration rather than random noise in the sensor? And so on.At the same time, I can also take data from 20 different sensors and try to predict a sensor’s response that can be characterized by them. If you are not very familiar with sensors, this might seem a bit strange, but this is indeed an important research area in environmental science.Using a model with 20 different variables to characterize the output of the sensor is clearly a prediction, and I do not expect the model to be interpretable. It is important to note that factors such as non-linearity produced by chemical kinetics and the relationship between physical variables and gas concentration may make this model very obscure, just like a neural network that is difficult to interpret. Although I hope this model can be understandable, I would be quite happy as long as it can make accurate predictions.If I try to prove that the relationship between data variables has statistical significance to publish in a scientific paper, I will use a statistical model rather than machine learning. This is because I am more concerned with the relationships between variables rather than making predictions. While making predictions may still be important, most machine learning algorithms lack interpretability, making it difficult to demonstrate the existence of relationships in the data.Clearly, these two approaches differ in their objectives, even though they use similar methods to achieve those objectives. The evaluation of machine learning algorithms uses a test set to verify their accuracy. However, for statistical models, analyzing regression parameters through confidence intervals, significance tests, and other tests can be used to assess the validity of the model. Because these methods yield the same results, it is easy to understand why people might assume they are the same.The Differences Between Statistics and Machine Learning in Linear RegressionThere has been a misunderstanding for 10 years: it is unreasonable to conflate these two terms simply based on the fact that they both utilize the same fundamental probability concepts.However, it is unreasonable to conflate these two terms simply based on the fact that they both utilize the same fundamental probability concepts. It’s like saying:

Physics is just a more pleasant way of saying mathematics.
Zoology is just a more pleasant way of saying stamp collecting.
Architecture is just a more pleasant way of saying sandcastle building.

These statements (especially the last one) are absurd and completely confuse terms of two similar ideas.In reality, physics is built on a mathematical foundation, and understanding physical phenomena in reality is an application of mathematics. Physics also includes various aspects of statistics, while modern statistics is typically built on a framework combining Zermelo-Frankel set theory and measure theory to produce probability spaces. They have many similarities because they come from similar origins and apply similar ideas to draw logical conclusions. Similarly, architecture and sandcastle building may have many commonalities, but even if I am not an architect, I can still see that they are clearly different.Before we further discuss, it is necessary to briefly clarify two other common misconceptions related to machine learning and statistics. That is, artificial intelligence is different from machine learning, and data science is different from statistics. These are uncontroversial issues, so they can be quickly clarified.Data science is essentially the application of computational and statistical methods to data, including both small and large datasets. It also includes things like exploratory data analysis, which involves examining and visualizing data to help scientists better understand the data and make inferences from it. Data science also includes things like data wrangling and preprocessing, so it involves a certain degree of computer science since it involves coding and establishing connections and pipelines between databases and web servers, etc.To perform statistics, you do not necessarily need to rely on computers, but if it is data science, it cannot operate without computers. This again illustrates that while data science relies on statistics, the two are not the same concept.Similarly, machine learning is not artificial intelligence; in fact, machine learning is a branch of artificial intelligence. This is quite obvious because we “teach” (train) machines to make generalized predictions about specific types of data based on past data.Machine Learning is Based on StatisticsBefore we discuss the differences between statistics and machine learning, let’s talk about their similarities, which have already been explored somewhat in the first half of the article.Machine learning is based on a statistical framework because machine learning involves data, and data must be described within a statistical framework, so this is very evident. However, extending to the statistical mechanisms of thermodynamics for a large number of particles is also built on a statistical framework.The concept of pressure is actually data, and temperature is also a form of data. You might think this sounds unreasonable, but it is true. That is why you cannot describe the temperature or pressure of a molecule; it is unreasonable. Temperature is a display of the average energy generated by molecular collisions. And for entities like houses or outdoors, which have a large number of molecules, it is reasonable to describe them using temperature.Would you think that thermodynamics and statistics are one and the same? Of course not, thermodynamics uses statistics to help us understand the interactions of motion and the heat generated in transfer phenomena.In fact, thermodynamics is based on multiple disciplines, not solely on statistics. Similarly, machine learning is based on content from many other fields, such as mathematics and computer science. For example:

The theory of machine learning stems from mathematics and statistics;
Machine learning algorithms are based on optimization theory, matrix algebra, and calculus;
The implementation of machine learning comes from concepts in computer science and engineering, such as kernel mapping, feature hashing, etc.

When a person starts programming in Python and suddenly finds and uses these algorithms from the Sklearn library, many of the above concepts are quite abstract, making it difficult to see the differences among them. In such cases, this abstract definition also leads to a degree of ignorance about what machine learning truly encompasses.Statistical Learning Theory – The Statistical Foundation of Machine LearningThe primary difference between statistics and machine learning lies in the fact that statistics is entirely based on probability spaces. You can derive all statistical content from set theory, which discusses how we classify data (these classes are called “sets”) and then measure this set to ensure its total sum is 1. We call this method probability space.Statistics, aside from defining these sets and measurements, does not have other assumptions. This is why our definition of probability space is very rigorous. A probability space, mathematically written as (Ω,F,P), contains three parts:

A sample space, Ω, which is the set of all possible outcomes.
A collection of events, F, where each event contains 0 or other values.
A probability assigned to each event, P, which is a function from events to probabilities.

Machine learning is based on statistical learning theory, which is still based on an axiomatized language of probability spaces. This theory is based on traditional statistical theory and developed in the 1960s.Machine learning is divided into multiple categories; in this article, I will only focus on supervised learning theory because it is the easiest to explain (although it is still somewhat obscure due to the abundance of mathematical concepts).In supervised learning of statistical learning theory, we are given a dataset denoted as S= {(xᵢ,yᵢ)}, meaning we have a dataset containing N data points, each described by other values called “features” denoted by x, and these features are depicted through a specific function to return the desired y value.Given this dataset, the question is how to find the function that maps x values to y values. We refer to the set of all possible functions that describe this mapping process as the hypothesis space.To find this function, we need to provide the algorithm with some method to “learn” how to best tackle this problem, which is provided by a concept called the “loss function.” Therefore, for each of our hypotheses (i.e., proposed functions), we measure the performance of this function by comparing the expected risk value under all data.Expected risk is essentially the sum of the loss function multiplied by the probability distribution of the data. If we know the joint probability distribution of this mapping, finding the optimal function is straightforward. However, this joint probability distribution is usually unknown, so our best approach is to guess an optimal function and then empirically verify whether the loss function is optimized. We call this empirical risk.After that, we can compare different functions to find the hypothesis with the minimum expected risk, which is the hypothesis that yields the smallest lower bound value among all functions.However, to minimize the loss function, the algorithm has a tendency to cheat through overfitting. This is also why we need to “learn” the function through the training set, and then validate the function on the test set, which is data outside of the training set.How we define the essence of machine learning leads to the issue of overfitting and explains the need to distinguish between training and testing sets. In statistics, we do not need to try to minimize empirical risk; overfitting is not an inherent feature of statistics. The learning algorithm that minimizes empirical risk in statistics is called empirical risk minimization.IllustrationLet’s take a simple example using linear regression. In traditional concepts, we try to minimize the errors in the data to find a function that can describe the data, and in this case, we typically use mean variance. Using squares is to prevent positive and negative values from canceling each other out. Then we can use a closed-form expression to find the regression coefficients.If we consider the loss function as mean variance and minimize empirical risk based on statistical learning theory, we happen to arrive at the same results as traditional linear regression analysis.This coincidence occurs because the two situations are the same; solving for maximum probability in the same way with the same data naturally yields the same results. Maximizing probability can be achieved through different methods to accomplish the same goal, but no one would argue that maximizing probability is the same as linear regression. This simplest example clearly fails to distinguish between these methods.Another point to note is that traditional statistical methods do not have the concept of training and test sets, but we will use different metrics to help validate the model. Although the validation processes differ, both methods can yield statistically robust results.Additionally, it is worth noting that traditional statistical methods provide us with an optimal solution in closed form, and they do not test other potential functions to converge on a result. In contrast, machine learning methods attempt a batch of different models and ultimately combine the results of regression algorithms to converge on a final hypothesis.If we use a different loss function, the results may not converge. For example, if we use hinge loss (which is not easily distinguishable when using standard gradient descent, so methods like quasi-gradient descent are required), then the results will differ.Finally, we can distinguish model bias. You can use machine learning algorithms to test linear models, polynomial models, exponential models, etc., to check whether these hypotheses provide a better fit to the dataset relative to our prior loss function. In traditional statistical concepts, we select a model, assess its accuracy, but cannot automatically extract the optimal one from 100 different models. Clearly, due to the algorithm chosen at the outset, there will always be some bias in the models found. Choosing an algorithm is essential because finding the optimal equation for a dataset is an NP-hard problem.So which method is superior?This question is actually quite foolish. Without statistics, machine learning cannot exist, but due to the contemporary information explosion, the vast amount of data humans can access makes machine learning extremely useful.Comparing machine learning and statistical models is even more challenging; which one to choose depends on your goals. If you only want to create a highly accurate algorithm to predict housing prices or find out which types of people are more likely to get a certain disease, machine learning might be the better choice. If you want to find relationships between variables or draw inferences from data, choosing statistical models would be better.

Text in the image:Is this your machine learning system?Yes, you just dump all the data into this big pile or linear algebra, and get the answer from the other end.What if the answer is wrong?Then just stir it until it looks right.

If your foundation in statistics is not solid, you can still learn and use machine learning—abstract concepts in machine learning libraries allow you to use them easily as an amateur, but you still need to understand statistical concepts to avoid overfitting models or drawing seemingly reasonable inferences.Related reports:https://towardsdatascience.com/the-actual-difference-between-statistics-and-machine-learning-64b49f07ea3?gi=412e8f93e22e

Editor: Huang Jiyan

Leave a Comment Cancel reply