In-Depth Analysis of Model Evaluation Methods

Click on the aboveBeginner Learning Vision”, choose to add “Star” or “Top

Heavyweight content delivered first-handIn-Depth Analysis of Model Evaluation Methods

We train and learn the models well, and only through objectively evaluating the model’s performance can we make better practical decisions. Model evaluation mainly includes: prediction error situations, degree of fitting, model stability, etc. There are also some scenarios where requirements for model prediction speed (throughput), computational resource consumption, and interpretability exist, which will not be elaborated here.

1. Evaluating Prediction Error Situations

The prediction error situation of machine learning models is usually the focus of evaluation. It is not only about having good learning and prediction capabilities on training data during the learning process, but fundamentally about having good predictive capabilities on new data (generalization ability). Therefore, we often evaluate the model’s generalization performance through the metrics performance on the test set.

Commonly used loss functions are indicators to assess the model’s prediction error, such as the mean squared error for regression predictions. However, using loss functions as evaluation metrics has some limitations and is not intuitive (for example, classification tasks often use F1-score, which can directly show the correct classification situation of various categories). Here, we mainly interpret the commonly used error evaluation metrics for regression and classification prediction tasks separately.

1.1 Error Evaluation Metrics for Regression Tasks

To evaluate the error of regression models, a simple approach is to take the average of the absolute differences between the true values and the predicted values. As follows:

  • Mean Squared Error (MSE)
    Mean squared error (MSE) is the average of the squared differences between the actual values and the predicted values, where y is the actual value and y^ is the predicted value.
In-Depth Analysis of Model Evaluation Methods
  • Root Mean Squared Error (RMSE)

Root mean squared error (RMSE) is the square root of MSE.

In-Depth Analysis of Model Evaluation Methods
  • Mean Absolute Error (MAE)

Mean absolute error (MAE) is the average of the absolute differences between the predicted values and the true values.

In-Depth Analysis of Model Evaluation MethodsSince MAE uses absolute values (which are not differentiable), it is rarely used as a training loss function. However, it is still applicable for final model evaluation.

  • Root Mean Squared Logarithmic Error (RMSLE)In-Depth Analysis of Model Evaluation Methods

Comparison of the above indicators:

① Sensitivity to outliers: MAE is the actual prediction error, while RMSE and MSE square the differences, amplifying the influence of larger error samples (more sensitive to outliers). If there are a few extreme outliers, even if they are few, these two metrics can become very poor. To reduce the impact of outliers, RMSLE can be used, which focuses on the ratio of prediction errors, thereby reducing the impact of outliers.

② Dimensionality differences: Unlike MSE which squares the differences, RMSE (which squares the differences and then takes the square root) and MAE maintain the original dimensionality, making them more intuitive. Although RMSE and MAE have the same dimensionality, RMSE is actually larger than MAE. This is because RMSE first squares the errors and then sums them before taking the square root, which amplifies the differences between the errors.

③ Cross-task dimensionality difference problem: In practical applications, RMSE and MAE have a problem where the dimensionality varies across different tasks. For example, if we predict stock price errors to be 10 yuan and house price errors to be 10,000 yuan, we cannot evaluate which model performs better across different tasks. Next, we introduce the R^2 score indicator, which normalizes the above errors and provides a unified evaluation standard.

  • R^2 Score

The R^2 score is commonly used to evaluate the fitting effect of linear regression, defined as follows:

In-Depth Analysis of Model Evaluation MethodsThe R^2 score can be viewed as the ratio of our model’s mean squared error to the mean squared error when using the average of the actual values as the prediction (like a baseline model).In-Depth Analysis of Model Evaluation MethodsThus, the R^2 score is reduced to the range [0,1]. When its value is 0, it means our model has no effect, consistent with the baseline model. When the value is 1, the model performs best, indicating that the model has no errors.

To add, when the R^2 value is 0 and the model is linear regression, it can also indirectly indicate that there is no linear relationship between features and labels. In-Depth Analysis of Model Evaluation MethodsThis is also the principle behind the commonly used multicollinearity indicator VIF, which attempts to use each feature as a label and learns to fit with other features to obtain the linear model R^2 value, calculating VIF. A VIF of 1 indicates that there is no multicollinearity between features (multicollinearity affects the stability and interpretability of linear models, and a threshold of VIF < 10 is commonly used in engineering).

1.2 Error Evaluation Metrics for Classification Models

For classification model errors, loss functions (such as cross-entropy) can be used for evaluation. In classification models, cross-entropy is more appropriate than MSE; simply put, MSE indiscriminately focuses on the differences between predicted probabilities and true probabilities across all categories, while cross-entropy focuses on the predicted probability of the correct category.

In-Depth Analysis of Model Evaluation Methods

However, using loss functions to evaluate classification performance is not very intuitive, so evaluation of classification tasks often also uses F1-score, precision, and recall, which can directly show the correct classification situation of various categories.

  • Precision, Recall, F1-score, Accuracy
In-Depth Analysis of Model Evaluation Methods

Accuracy (accuracy). This is the proportion of all correct predictions (TP+TN) to the total number (TP+FP+TN+FN);

Precision P (precision): This refers to the number of correct samples (TP) predicted as Positive by the classifier, divided by the total number of samples predicted as Positive (TP+FP);

Recall R (recall): This refers to the number of correct samples (TP) predicted as Positive by the classifier, divided by the total number of actual Positive samples (TP+FN).

F1-score is the harmonic mean of precision P and recall R: In-Depth Analysis of Model Evaluation Methods

Summary of the above indicators:

① Comprehensive accuracy of all categories: Accuracy (accuracy) provides a direct description of classification errors. However, in cases of imbalance between positive and negative examples, the accuracy evaluation has little reference value. For example, in a fraud detection scenario, there may be 950 normal user samples (negative examples) and 50 abnormal users (positive examples). If the model predicts all samples as normal users, the accuracy would be very high at 95%. However, in reality, the classification effect is very poor. Accuracy cannot express the error classification situation of minority categories, so F1-score is more commonly used, as it comprehensively considers both precision and recall.

② Balancing precision and recall: Precision and recall are often contradictory indicators. Sometimes, it is necessary to choose to be “more precise” or “more comprehensive” based on business considerations (for example, in fraud detection scenarios, there is often a bias towards identifying more positive examples, even at the cost of higher false positives. “Better to mistakenly kill a hundred than to let one go”). In this case, the precision-recall curve (P-R curve) under different threshold divisions can be used to balance the two.In-Depth Analysis of Model Evaluation Methods

  • Kappa Value
    Kappa is an indicator used for consistency testing (for classification problems, consistency refers to whether the model’s prediction results are consistent with the actual classification results). The kappa value is also calculated based on the confusion matrix, and it is an indicator that can penalize the model’s prediction bias. According to the kappa calculation formula, the more unbalanced the confusion matrix (i.e., the greater the differences in prediction accuracy among different categories), the lower the kappa value.
In-Depth Analysis of Model Evaluation Methods

The formula can be interpreted as the ratio of total accuracy compared to random accuracy improvement and perfect model compared to random accuracy improvement: In-Depth Analysis of Model Evaluation Methods

Kappa values range from -1 to 1 and are usually greater than 0. They can be divided into five groups to represent different levels of consistency: 0.0~0.20 indicates very low consistency (slight), 0.21~0.40 indicates fair consistency, 0.41~0.60 indicates moderate consistency, 0.61~0.80 indicates substantial consistency, and 0.81~1 indicates almost perfect consistency.

  • ROC Curve, AUC
    The ROC curve (Receiver Operating Characteristic curve) is actually a comprehensive result of multiple confusion matrices. If we do not have a fixed threshold in the model above, but instead sort the model’s predicted results from high to low and use each probability value as a dynamic threshold, we will have multiple confusion matrices. In-Depth Analysis of Model Evaluation Methods

For each confusion matrix, we calculate two indicators: TPR (True Positive Rate) and FPR (False Positive Rate), where TPR=TP/(TP+FN)=Recall, and FPR=FP/(FP+TN). The final ROC curve is plotted with FPR on the x-axis and TPR on the y-axis.In-Depth Analysis of Model Evaluation MethodsBy calculating the area under the ROC curve, we obtain the AUC (Area Under Curve), which can intuitively evaluate the performance of the classifier, usually ranging between 0.5 and 1, with larger values being better.

Analysis summary of the AUC indicator:

  • Since the ROC measures “dynamic thresholds”, AUC does not depend on the classification threshold, freeing it from the limitations of observing classification effects based on fixed classification thresholds.

  • The ROC curve is plotted using different thresholds for TPR and FPR. A larger area under the ROC curve (AUC) means a larger TPR at a smaller FPR, and a smaller FPR corresponds to a larger 1-FPR = TN/(TN+FP)=TNR. Thus, AUC actually considers both TPR (also called recall, sensitivity) and TNR (also called specificity).

  • From the confusion matrix, it can be seen that AUC’s TNR (i.e., 1-FPR), TPR, and the actual good-bad ratio of samples are unrelated. They only focus on the comprehensiveness of recognizing the respective actual categories (unlike precision, which evaluates across actual categories). In simple terms: AUC is insensitive to the ratio of positive and negative samples; even if the ratio of positive to negative samples changes significantly, the area of the ROC curve will not change much.In-Depth Analysis of Model Evaluation Methods

  • AUC is the area under the ROC curve, and its numerical physical meaning is: the probability that a randomly given positive and negative sample will have the predicted score of the positive sample greater than that of the negative sample. In other words, AUC is an indicator of the “ranking ability” of discrimination (the probability score of positive samples being higher than negative samples), and it is not sensitive to specific judgment probabilities—ignoring the model’s fitting effect. For an excellent model, we expect the probability values of positive and negative samples to be sufficiently different. For example, if a model predicts all negative samples as 0.49 and positive samples as 0.51, then this model’s AUC would be 1 (but the probabilities of positive and negative samples are very close, so any disturbance would lead to mispredictions). We expect the model’s prediction to have a large gap, such as negative samples being predicted below 0.1 and positive samples above 0.8. In this case, although the AUC is the same, this model has a better fitting effect and is more robust.

In-Depth Analysis of Model Evaluation Methods

Differences between AUC and F1-score

  • AUC does not depend on classification thresholds, while F1-score requires a specified threshold, leading to different results at different thresholds;
  • When the ratio of positive and negative samples changes, AUC is not greatly affected, while F1-score can be significantly impacted (because precision evaluates across actual categories);
  • Both include recall (the comprehensiveness of recognizing positive samples) and consider FP (the misclassification of negative samples as positive), requiring a balance of “completeness” and “precision”.
  • F1-score can flexibly adjust the emphasis on recall and precision through thresholds, while AUC only provides a general overview.
# The above indicators can be directly called from sklearn.metrics
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_curve, auc, cohen_kappa_score, mean_squared_error
...
yhat = model.predict(x)

f1_score(y, yhat)

2. Model Fitting Degree

For the fitting degree of the model, terms like underfitting, well-fitted, and overfitting are commonly used. Generally, well-fitted models have better generalization abilities and perform better on unknown data (test sets).

We can assess the fitting degree of the model through the errors of the training and validation sets (such as loss functions). From the overall training process, underfitting occurs when both the training error and validation set error are high, decreasing as training time and model complexity increase. After reaching an optimal fitting critical point, the training error decreases while the validation set error increases, at this point, the model enters the overfitting region. In-Depth Analysis of Model Evaluation MethodsIn practice, underfitting is usually not a problem and can be improved by using strong features and more complex models to enhance learning accuracy. However, addressing overfitting—reducing generalization errors and enhancing generalization ability—is usually the key to optimizing model performance. Common methods to address overfitting include improving data quality, quantity, and employing appropriate regularization strategies. For specifics, see the series of articles: A Deep Dive into Solving Model Overfitting.

3. Model Stability

If the deployed model is unstable, it means the model is uncontrollable, affecting the rationality of decisions. For business, this represents a form of uncertainty risk, which is unacceptable (especially in risk-averse fields like risk control).

We usually use the Population Stability Index (PSI) to measure whether the distribution ratios of scores between future (test set) samples and model training samples remain consistent, to evaluate model stability. Similarly, PSI can also be used to measure the distribution differences of feature values to assess stability at the data feature level. In-Depth Analysis of Model Evaluation Methods

The PSI calculation uses the model scores from the training samples as a reference point for stability (expected score proportions), measuring the error between the actual predicted scores (actual distribution proportions). The calculation formula is SUM(each score segment’s (actual proportion – expected proportion) * ln(actual proportion / expected proportion))In-Depth Analysis of Model Evaluation MethodsThe specific calculation steps and example code are as follows:

Step 1: Discretize the expected value distribution (development dataset) into bins and count the sample proportions in each bin.
Step 2: Similarly, for the actual distribution (test set), count the sample proportions within each bin.
Step 3: Calculate the A – E and Ln(A / E) within each bin, and compute index = (actual proportion – expected proportion) * ln(actual proportion / expected proportion).
Step 4: Sum the indices of each bin to obtain the final PSI.
In-Depth Analysis of Model Evaluation Methods
import math
import numpy as np
import pandas as pd

def calculate_psi(base_list, test_list, bins=20, min_sample=10):
    try:
        base_df = pd.DataFrame(base_list, columns=['score'])
        test_df = pd.DataFrame(test_list, columns=['score']) 
        
        # 1. Remove missing values and count the number of samples in both distributions
        base_notnull_cnt = len(list(base_df['score'].dropna()))
        test_notnull_cnt = len(list(test_df['score'].dropna()))

        # Empty bins
        base_null_cnt = len(base_df) - base_notnull_cnt
        test_null_cnt = len(test_df) - test_notnull_cnt
        
        # 2. Minimum number of bins
        q_list = []
        if type(bins) == int:
            bin_num = min(bins, int(base_notnull_cnt / min_sample))
            q_list = [x / bin_num for x in range(1, bin_num)]
            break_list = []
            for q in q_list:
                bk = base_df['score'].quantile(q)
                break_list.append(bk)
            break_list = sorted(list(set(break_list))) # Remove duplicates and sort
            score_bin_list = [-np.inf] + break_list + [np.inf]
        else:
            score_bin_list = bins
        
        # 4. Count the number of samples in each bin
        base_cnt_list = [base_null_cnt]
        test_cnt_list = [test_null_cnt]
        bucket_list = ["MISSING"]
        for i in range(len(score_bin_list)-1):
            left  = round(score_bin_list[i+0], 4)
            right = round(score_bin_list[i+1], 4)
            bucket_list.append("(" + str(left) + ',' + str(right) + ']')
            
            base_cnt = base_df[(base_df.score > left) & (base_df.score <= right)].shape[0]
            base_cnt_list.append(base_cnt)
            
            test_cnt = test_df[(test_df.score > left) & (test_df.score <= right)].shape[0]
            test_cnt_list.append(test_cnt)
        
        # 5. Summary statistics results    
        stat_df = pd.DataFrame({"bucket": bucket_list, "base_cnt": base_cnt_list, "test_cnt": test_cnt_list})
        stat_df['base_dist'] = stat_df['base_cnt'] / len(base_df)
        stat_df['test_dist'] = stat_df['test_cnt'] / len(test_df)
        
        def sub_psi(row):
            # 6. Calculate PSI
            base_list = row['base_dist']
            test_dist = row['test_dist']
            # Handle the situation where the number of samples in a bin is 0
            if base_list == 0 and test_dist == 0:
                return 0
            elif base_list == 0 and test_dist > 0:
                base_list = 1 / base_notnull_cnt   
            elif base_list > 0 and test_dist == 0:
                test_dist = 1 / test_notnull_cnt
                
            return (test_dist - base_list) * np.log(test_dist / base_list)
        
        stat_df['psi'] = stat_df.apply(lambda row: sub_psi(row), axis=1)
        stat_df = stat_df[['bucket', 'base_cnt', 'base_dist', 'test_cnt', 'test_dist', 'psi']]
        psi = stat_df['psi'].sum()
        
    except:
        print('error!!!')
        psi = np.nan 
        stat_df = None
    return psi, stat_df

## You can also directly call the toad package to calculate PSI
# prob_dev model scores in training samples, prob_test model scores in test samples
psi = toad.metrics.PSI(prob_dev,prob_test)

Analyzing the principle of the PSI indicator, we can find that the meaning of PSI is equivalent to the KL divergence between the actual distribution (A) and the expected distribution (E) + the KL divergence between the expected distribution (E) and the actual distribution (A), which can asymmetrically describe the differences in information entropy, thus providing a more comprehensive description of distribution differences.

In-Depth Analysis of Model Evaluation Methods

The smaller the PSI value (commonly < 0.1 as a standard), the smaller the difference between the two distributions, indicating greater stability. In-Depth Analysis of Model Evaluation MethodsThe advantage of PSI in practical applications lies in its computational convenience. However, it is important to note that the PSI calculation is influenced by multiple factors, including the number and method of grouping, population sample size, and actual business policies. Particularly for small samples with significant business fluctuations, PSI values often exceed general empirical levels, thus requiring specific analysis in conjunction with actual business and data conditions.

Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial

Reply "Extension Module Chinese Tutorial" in the backend of the "Beginner Learning Vision" public account to download the first Chinese version of the OpenCV extension module tutorial available online, covering installation of extension modules, SFM algorithms, stereo vision, target tracking, biological vision, super-resolution processing, and more than twenty chapters of content.

Download 2: Python Vision Practical Projects 52 Lectures

Reply "Python Vision Practical Projects" in the backend of the "Beginner Learning Vision" public account to download 31 vision practical projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, and face recognition, to help quickly learn computer vision.

Download 3: OpenCV Practical Projects 20 Lectures

Reply "OpenCV Practical Projects 20 Lectures" in the backend of the "Beginner Learning Vision" public account to download 20 practical projects based on OpenCV for advanced learning of OpenCV.

Group Chat

Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (These will be gradually refined in the future). Please scan the WeChat ID below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format for notes; otherwise, you will not be approved. After successfully adding, you will be invited to join relevant WeChat groups based on your research direction. Please do not send advertisements in the group, or you will be removed from the group. Thank you for your understanding~

Leave a Comment