12 Essential Machine Learning Model Evaluation Metrics

The idea of building a machine learning or deep learning model follows the principle of constructive feedback.You build a model, get feedback from metrics, improve it, and continue until you achieve the desired classification accuracy.The evaluation metrics explain the performance of the model.An important aspect of evaluation metrics is their ability to distinguish the results of models.

This article explains 12 important evaluation metrics that data science professionals must understand. You will learn about their uses, advantages, and disadvantages, which will help you select and implement them accordingly.

12 Essential Machine Learning Model Evaluation Metrics

Recommended online tools: Three.js AI Texture Development Kit – YOLO Synthetic Data Generator – GLTF/GLB Online Editor – 3D Model Format Online Converter – Programmable 3D Scene Editor

12 Essential Machine Learning Model Evaluation Metrics

1. Background Knowledge

12 Essential Machine Learning Model Evaluation Metrics

Evaluation metrics are quantitative measures used to assess the performance and effectiveness of statistical or machine learning models. These metrics provide insights into how well the model is performing and help compare different models or algorithms.

When evaluating machine learning models, it is crucial to assess their predictive capability, generalization ability, and overall quality. Evaluation metrics provide an objective standard for measuring these aspects. The choice of evaluation metrics depends on the specific problem domain, data type, and expected outcomes.

I have seen many analysts and aspiring data scientists who are even too lazy to check how robust their models are. Once they finish building the model, they hurriedly map the predictions to unseen data. This is an incorrect approach. The fundamental fact is that building a predictive model is not your motivation. It is about creating and selecting a model that can provide a high accuracy score on out-of-sample data. Therefore, it is crucial to check the accuracy of the model before calculating the predictions.

In our industry, we consider different types of metrics to evaluate our machine learning models. The choice of evaluation metrics entirely depends on the type of model and the implementation plan of the model. After completing the model building, these 12 metrics will help you assess the accuracy of the model. Considering the growing popularity and importance of cross-validation, I have also mentioned its principles in this article.

1.1 Types of Predictive Models

When we talk about predictive models, we refer to regression models (continuous output) or classification models (nominal or binary output). The evaluation metrics used in each model differ.

In classification problems, we use two types of algorithms (depending on the type of output it creates):

Class Output: Algorithms like SVM and KNN create class outputs. For example, in a binary classification problem, the output will be 0 or 1. Today, we have algorithms that can convert these class outputs into probabilities. However, these algorithms have not been well accepted in the statistical community.
Probability Output: Algorithms like logistic regression, random forests, gradient boosting, and Adaboost provide probability outputs. Converting probability outputs into class outputs is merely a matter of creating threshold probabilities.

In regression problems, we do not encounter this inconsistency. The output is essentially always continuous and does not require further processing.

For the discussion of classification model evaluation metrics, I used my predictions on the BCI challenge problem on Kaggle. The solution to the problem is outside the scope of our discussion here. However, this article uses the final predictions from the training set. The predictions for the problem are probability outputs, which are converted to class outputs with a threshold of 0.5.

12 Essential Machine Learning Model Evaluation Metrics

2. Common Evaluation Metrics

12 Essential Machine Learning Model Evaluation Metrics

Now we introduce commonly used evaluation metrics in machine learning.

2.1 Confusion Matrix

A confusion matrix is an N X N matrix, where N is the number of predicted categories. For the current problem, we have N=2, so we get a 2 X 2 matrix. It is a performance measure for machine learning classification problems where the output can be two or more categories. The confusion matrix is a table that contains 4 different combinations of predicted and actual values. It is very useful for measuring precision, recall, specificity, accuracy, and most importantly, the AUC-ROC curve.

Here are some definitions to remember about the confusion matrix:

True Positive: Predicted as positive, and it is true.
True Negative: Predicted as negative, and it is true.
False Positive: Type 1 error, predicted as positive, but the result is false.
False Negative: Type 2 error, predicted as negative, but the result is false.
Accuracy: The proportion of correct predictions to the total number of correct predictions.
Positive Predictive Value or Precision: The proportion of correctly identified positive samples.
Negative Predictive Value: The proportion of correctly identified negative samples.
Sensitivity or Recall: The proportion of correctly identified actual positive samples.
Specificity: The proportion of actual negative samples that are correctly identified.
Rate: It is a measurement factor in the confusion matrix. It also has four types: TPR, FPR, TNR, and FNR.

12 Essential Machine Learning Model Evaluation Metrics

The accuracy for our problem at hand is 88%. From the two tables above, we can see that the positive predictive value is high, but the negative predictive value is low. Sensitivity and specificity are also similarly affected. This is mainly driven by the threshold we choose. If we lower the threshold, these two pairs of completely different numbers will become closer.

Generally, we care about one of the metrics defined above. For example, in a pharmaceutical company, they would be more concerned about minimizing false positive diagnoses. Therefore, they would be more concerned about high specificity. On the other hand, loss models will focus more on sensitivity. The confusion matrix is typically used only with class output models.

2.2 F1 Score

In the previous section, we discussed precision and recall for classification problems and emphasized the importance of choosing a precision/recall basis for our use case. What if we try to achieve the best precision and recall at the same time for a specific use case? The F1-Score is the harmonic mean of precision and recall for classification problems. The formula for the F1-Score is as follows:

Now, one obvious question that comes to mind is why use the harmonic mean instead of the arithmetic mean. This is because the harmonic mean penalizes extreme values more. Let’s understand this with an example. We have a binary classification model with the following results:

Precision: 0, Recall: 1

Here, if we take the arithmetic mean, we would get 0.5. It is apparent that the above result comes from a foolish classifier that ignores input and predicts one class as output. Now, if we take the harmonic mean, we would get 0, which is accurate because the model is useless for all intents and purposes.

This seems straightforward. However, in some cases, data scientists may want to give more importance/weight to precision or recall. By slightly altering the expression above to include an adjustable parameter beta for this purpose, we get:

Fbeta measures the effectiveness of the model for the user, where the user values recall β times more than precision.

2.3 Gain and Lift Charts

Gain and lift charts mainly involve checking the ordering of probabilities. Here are the steps to construct a lift/gain chart:

Step 1: Calculate the probability for each observation
Step 2: Sort these probabilities in descending order.
Step 3: Construct deciles, each containing nearly 10% of the observations.
Step 4: Calculate the response rate for good (responders), bad (non-responders), and total for each decile.

You will get the following table to draw the gain/lift chart:

This is a very informative table. The cumulative gain chart is the graph between cumulative %Right and cumulative %Population. For the current case, this is the chart:

This chart tells you how well the model distinguishes responders from non-responders. For example, the first decile has 10% of the population and 14% of responders. This means we achieved a 140% lift at the first decile.

What is the maximum lift we can achieve at the first decile? From the first table in this article, we know that the total number of responders is 3850. Additionally, the first decile will contain 543 observations. Therefore, the maximum lift for the first decile could be 543/3850 ~ 14.1%. Thus, our model is very close to perfect.

Now let’s plot the lift curve. The lift curve is the graph between total lift and percentage of population. Note that for a random model, this value remains constant at 100%. This is the plot for the current case:

This chart tells you what? It indicates how well our model performs before the seventh decile. After that, every decile will tend to favor non-responders. Any lift @ decile above 100% until at least the third decile and maximum seventh decile indicates a good model. Otherwise, you might first consider oversampling.

Lift/gain charts are widely used in marketing campaign targeting problems. They tell us which decile the target customers of a specific marketing campaign can be located in. Additionally, it tells you how much response you can expect from the new target group.

2.4 K-S Chart

K-S or Kolmogorov-Smirnov chart measures the performance of a classification model. More accurately, K-S is a measure of the separation between positive and negative distributions. If the scores separate the overall population into two distinct groups, one containing all positive samples and the other containing all negative samples, then K-S is 100.

On the other hand, if the model cannot distinguish between positive and negative, it is as if the model randomly selects cases from the overall population. K-S would be 0. In most classification models, K-S will fall between 0 and 100, with higher values indicating better ability of the model to distinguish positive and negative cases.

For this case, the following table:

We can also plot %Cumulative Good and Bad to see the maximum separation. Here is an example chart:

The evaluation metrics introduced here are primarily used for classification problems. So far, we have learned about confusion matrices, lift and gain charts, and K-S charts. Let’s continue to learn some more important metrics.

2.5 Area Under the ROC Curve (AUC – ROC)

This is another popular evaluation metric in the industry. The main advantage of using the ROC curve is that it is independent of changes in the responder proportion. This statement will become clearer in the following sections.

First, let’s try to understand what the ROC (Receiver Operating Characteristic) curve is. If we look at the confusion matrix below, we find that for probability models, each metric will yield different values.

Thus, for each sensitivity, we will get different specificity. The distinction between the two is as follows:

The ROC curve is a graph between sensitivity and (1-specificity). (1-specificity) is also known as the false positive rate, and sensitivity is also known as the true positive rate. Here is the ROC curve for the current case.

We take the threshold = 0.5 as an example (see confusion matrix). This is the confusion matrix:

As you can see, the sensitivity at this threshold is 99.6%, and (1-specificity) is about 60%. This coordinate becomes a point in the ROC curve. To simplify this curve into a single number, we find the area under the curve (AUC).

Note that the area of the entire square is 1*1 = 1. Therefore, AUC itself is the ratio of the area under the curve to the total area. For the current case, we get an AUC ROC of 96.4%. Here are some rules of thumb:

.90-1 = Excellent (A)
.80-.90 = Good (B)
.70-.80 = Fair (C)
.60-.70 = Poor (D)
.50-.60 = Fail (F)

We find ourselves in the excellent range for the current model. However, this could just be overfitting. In this case, timely and timeout validation becomes very important.

Key points to remember:

For models that output classes, they will be represented as a single point on the ROC graph.
Such models cannot be compared with each other because the judgment needs to be made against a single metric rather than using multiple metrics. For example, models with parameters (0.2,0.8) and (0.8,0.2) could come from the same model; hence, these metrics should not be directly compared.
In the case of probability models, we are fortunate to get a number, i.e., AUC-ROC. But we still need to examine the entire curve to make a definitive decision. It is also possible that one model performs better in certain areas, while another model performs better in others.

Advantages of using ROC:

The degree of lift depends on the overall response rate of the population. Therefore, if the overall response rate changes, the same model will give different lift charts. The solution to this problem could be a real lift chart (finding the ratio of the lift for each decile to the perfect model lift). But such a ratio has little meaning for businesses.
On the other hand, the ROC curve is almost independent of response rates. This is because it has two axes derived from the bar calculations of the confusion matrix. If the response rate changes, the numerators and denominators of the x and y axes will change in a similar proportion.

2.6 Log Loss

AUC ROC considers predicted probabilities to determine the performance of our model. However, there is a problem with AUC ROC; it only considers the order of probabilities and does not take into account the model’s ability to predict higher probabilities for samples that are more likely to be positive. In this case, we can use log loss, which is simply the negative average of the logarithm of the calibrated predicted probabilities for each instance.

p(yi) is the predicted probability of the positive class
1-p(yi) is the predicted probability of the negative class
yi = 1 indicates positive class, 0 indicates negative class (actual value)

Let’s calculate the log loss for some random values to get the essence of the above mathematical function:

Log Loss (1, 0.1) = 2.303
Log Loss (1, 0.5) = 0.693
Log Loss (1, 0.9) = 0.105

If we plot this relationship, we will get the following curve:

It is evident from the gently declining slope to the right that as the predicted probability increases, log loss gradually decreases. However, moving in the opposite direction, when the predicted probability approaches 0, the log loss increases very rapidly.

Thus, the lower the log loss, the better the model. However, there is no absolute measure for what constitutes a good log loss, and it depends on the use case/application.

While AUC is calculated based on binary classifications with different decision thresholds, log loss actually considers the “certainty” of classifications.

2.7 Gini Coefficient

The Gini coefficient is sometimes used for classification problems. The Gini coefficient can be directly derived from the AUC ROC score. The Gini coefficient is simply the ratio of the area between the ROC curve and the diagonal line to the area of the triangle above it. Here is the formula used:

Gini = 2*AUC – 1

A Gini coefficient above 60% indicates a good model. For this case, we get a Gini coefficient of 92.7%.

2.8 Consistent/Inconsistent Ratio

This is again one of the most important evaluation metrics for any classification prediction problem. To understand this, let’s assume there are 3 students who are likely to pass the exam this year. Here are our predictions:

A – 0.9
B – 0.5
C – 0.3

Now imagine, if we take two pairs from these three students, how many pairs do we get? We will have 3 pairs: AB, BC, and CA. Now, at the end of the year, we see that A and C passed this year while B failed. No, we choose all pairs that can find one responder and another non-responder. How many such pairs do we have?

We have two pairs AB and BC. Now, for each of the 2 pairs, a consistent pair is one where the probability of the responder is higher than that of the non-responder. An inconsistent pair is the opposite. If the two probabilities are equal, we say it is a tie. Let’s see what happens in our case:

AB: Consistent
BC: Inconsistent

Thus, in this example, we have 50% consistent cases. A consistency rate above 60% is considered a good model. This metric is not typically used when deciding how many customers to target, etc. It is primarily used to assess the predictive capability of the model. Decisions like the number of targets are again made by KS/lift charts.

2.9 Root Mean Square Error (RMSE)

RMSE is the most commonly used evaluation metric in regression problems. It follows the assumption that the errors are unbiased and follow a normal distribution. Here are the key points to consider for RMSE:

The power of the “square root” allows this metric to display a large amount of deviation.
The “squared” nature of this metric helps provide more reliable results, preventing the cancellation of positive and negative error values. In other words, this measure appropriately reflects the reasonable size of the error term.
It avoids using absolute error values, which are highly undesirable in mathematical calculations.
Using RMSE to reconstruct the error distribution is considered more reliable when we have more samples.
RMSE is significantly affected by outliers. Therefore, ensure that you have removed outliers from the dataset before using this metric.
Compared to mean absolute error, RMSE gives higher weight and penalizes large errors.

The RMSE metric is given by the following formula:

12 Essential Machine Learning Model Evaluation Metrics

Where N is the total number of observations.

2.10 Root Mean Square Log Error

For root mean square log error, we take the logarithm of the predicted and actual values. So basically, what variance are we measuring? RMSLE is typically used when we do not want to penalize huge discrepancies between predicted and actual values (when both predicted and actual values are huge numbers).

If both predicted and actual values are small: RMSE and RMSLE are the same.
If either predicted or actual values are large: RMSE > RMSLE
If both predicted and actual values are large: RMSE > RMSLE (RMSLE becomes negligible)

2.11 R Squared

We understand that as RMSE decreases, the performance of the model improves. But these values alone are not intuitive.

For classification problems, if the accuracy of the model is 0.8, we can measure how good our model is compared to a random model with an accuracy of 0.5. So the random model can serve as a baseline. But when we talk about the RMSE metric, we do not have a baseline to compare.

This is where we can use the R-squared measure. The formula for R-squared is as follows:

12 Essential Machine Learning Model Evaluation Metrics

MSE (model): Mean squared error of predictions vs actual values
MSE (baseline): Mean squared error of mean predictions vs actual values

In other words, how well does our regression model perform compared to a very simple model that predicts the average of the targets in the training set?

2.12 Adjusted R Squared

If the performance of the model equals the baseline, then R-squared is 0. The better the model, the higher the r2 value. The best model with all correct predictions has an R-squared value of 1. However, when adding new features to the model, the R-squared value can either increase or stay the same. R-squared is not penalized for adding features that have no value to the model. Therefore, an improved version of R-squared is the adjusted R-squared. The formula for adjusted R-squared is given by:

k: Number of features
n: Number of samples

As you can see, this metric takes into account the number of features. When we add more features, the term in the denominator n-(k +1) decreases, thus increasing the entire expression.

If R-squared does not increase, it means that the added features do not add value to our model. So overall, we subtract a larger value from 1, and the adjusted r2 conversely decreases.

In addition to these 12 evaluation metrics, there is another method to check model performance. These 7 methods have statistical significance in data science. However, with the advent of machine learning, we now have the privilege of having more powerful model selection methods. Yes! I am talking about cross-validation.

Although cross-validation is not a truly publicly available evaluation metric for conveying model accuracy, the results of cross-validation provide a sufficiently good intuitive result to summarize the model’s performance.

Now let’s take a closer look at cross-validation.

12 Essential Machine Learning Model Evaluation Metrics

3. Cross-Validation

12 Essential Machine Learning Model Evaluation Metrics

Let’s first understand the importance of cross-validation. Due to a busy schedule these days, I haven’t had much time to participate in data science competitions. Long ago, I participated in the TFI competition on Kaggle. Without going into my competition performance in detail, I want to show you the difference between my public leaderboard score and private leaderboard score.

For the TFI competition, here are my three solutions and scores (the lower, the better)

12 Essential Machine Learning Model Evaluation Metrics

You will notice that the third entry with the worst public score is the best model in the private ranking. There are over 20 models above “submission_all.csv”, yet I still chose “submission_all.csv” as my final entry (it performed really well). What causes this phenomenon? The difference between my public and private leaderboard is due to overfitting.

Overfitting is not an issue, but when your model becomes very complex, it also starts capturing noise. This “noise” adds no value to the model, only increasing inaccuracy.

In the next section, I will discuss how to know if a solution is overfitting before we actually know the test set results.

3.1 The Concept of Cross-Validation

Cross-validation is one of the most important concepts in any type of data modeling. It simply states that before finalizing the model, try to leave out a sample that is not used to train the model and test the model on that sample.

12 Essential Machine Learning Model Evaluation Metrics

The above figure shows how to validate the model using real-time samples. We simply divide the overall population into 2 samples and build the model on one sample. The remaining population is used for timely validation.

Does this method have a downside?

I believe the downside of this method is that we lose a significant amount of data when training the model. Therefore, the bias of this model is very high. This does not provide the best estimate of coefficients. So what is the next best option?

If we split the training population and the first 50 populations in a 50:50 ratio and then validate the remaining 50 populations, what would happen? Then we train the other 50 samples and test the first 50 samples. This way, we can train the model on the entire population at once while training 50% at a time. This reduces the bias from sample selection to some extent but provides a smaller sample to train the model. This method is called 2-fold cross-validation.

3.2 K-Fold Cross-Validation

Let’s infer from the last example of 2-fold cross-validation to k-fold. Now we will try to visualize how k-fold validation works.

12 Essential Machine Learning Model Evaluation Metrics

This is 7-fold cross-validation.

What happens behind the scenes is this: We divide the entire population into 7 equal samples. Now, we train the model on 6 samples (green box) and validate it on 1 sample (gray box). Then, in the second iteration, we use a different sample for validation to train the model. In 7 iterations, we basically build a model on each sample and use each sample for validation. This is a way to reduce selection bias and variance in predictive capability. Once we have all 7 models, we calculate the average of the error terms to find out which model is the best.

How does this help find the best (non-overfitted) model?

K-fold cross-validation is widely used to check if a model is overfitting. If the performance metrics across the k modeling are close to each other and the average metric is the highest, you might rely more on the cross-validation score in Kaggle competitions rather than the Kaggle public score. This way, you can ensure that the public score is not just a coincidence.

But how do we choose k?

This is the tricky part. We need to weigh the trade-offs when choosing k.

For smaller k, we have higher selection bias but lower performance variance.
For larger k, we have less selection bias, but greater performance variance.

Think about the extreme cases:

k = 2: We only have 2 samples similar to the 50-50 example. Here, we build models for only 50% of the population each time. But since validation is a significant group, the variance of validation performance is low.
k = number of observations (n): This is also known as “leave-one-out.” We have n samples, and modeling is repeated n times, leaving one observation for cross-validation. Thus, selection bias is small, but variance of validation performance is very high.

Generally, for most purposes, a value of k = 10 is recommended.

12 Essential Machine Learning Model Evaluation Metrics

4. Conclusion

12 Essential Machine Learning Model Evaluation Metrics

Leave a Comment Cancel reply