Click on the above “Beginner Learning Vision”, choose to add “Star” or “Top”
Heavyweight content delivered first-hand
We train and learn the models well, and only through objectively evaluating the model’s performance can we make better practical decisions. Model evaluation mainly includes: prediction error situations, degree of fitting, model stability, etc. There are also some scenarios where requirements for model prediction speed (throughput), computational resource consumption, and interpretability exist, which will not be elaborated here.
1. Evaluating Prediction Error Situations
The prediction error situation of machine learning models is usually the focus of evaluation. It is not only about having good learning and prediction capabilities on training data during the learning process, but fundamentally about having good predictive capabilities on new data (generalization ability). Therefore, we often evaluate the model’s generalization performance through the metrics performance on the test set.
Commonly used loss functions are indicators to assess the model’s prediction error, such as the mean squared error for regression predictions. However, using loss functions as evaluation metrics has some limitations and is not intuitive (for example, classification tasks often use F1-score, which can directly show the correct classification situation of various categories). Here, we mainly interpret the commonly used error evaluation metrics for regression and classification prediction tasks separately.
1.1 Error Evaluation Metrics for Regression Tasks
To evaluate the error of regression models, a simple approach is to take the average of the absolute differences between the true values and the predicted values. As follows:
-
Mean Squared Error (MSE)
Mean squared error (MSE) is the average of the squared differences between the actual values and the predicted values, where y is the actual value and y^ is the predicted value.

-
Root Mean Squared Error (RMSE)
Root mean squared error (RMSE) is the square root of MSE.

-
Mean Absolute Error (MAE)
Mean absolute error (MAE) is the average of the absolute differences between the predicted values and the true values.
Since MAE uses absolute values (which are not differentiable), it is rarely used as a training loss function. However, it is still applicable for final model evaluation.
-
Root Mean Squared Logarithmic Error (RMSLE)
Comparison of the above indicators:
① Sensitivity to outliers: MAE is the actual prediction error, while RMSE and MSE square the differences, amplifying the influence of larger error samples (more sensitive to outliers). If there are a few extreme outliers, even if they are few, these two metrics can become very poor. To reduce the impact of outliers, RMSLE can be used, which focuses on the ratio of prediction errors, thereby reducing the impact of outliers.
② Dimensionality differences: Unlike MSE which squares the differences, RMSE (which squares the differences and then takes the square root) and MAE maintain the original dimensionality, making them more intuitive. Although RMSE and MAE have the same dimensionality, RMSE is actually larger than MAE. This is because RMSE first squares the errors and then sums them before taking the square root, which amplifies the differences between the errors.
③ Cross-task dimensionality difference problem: In practical applications, RMSE and MAE have a problem where the dimensionality varies across different tasks. For example, if we predict stock price errors to be 10 yuan and house price errors to be 10,000 yuan, we cannot evaluate which model performs better across different tasks. Next, we introduce the R^2 score indicator, which normalizes the above errors and provides a unified evaluation standard.
-
R^2 Score
The R^2 score is commonly used to evaluate the fitting effect of linear regression, defined as follows:
The R^2 score can be viewed as the ratio of our model’s mean squared error to the mean squared error when using the average of the actual values as the prediction (like a baseline model).
Thus, the R^2 score is reduced to the range [0,1]. When its value is 0, it means our model has no effect, consistent with the baseline model. When the value is 1, the model performs best, indicating that the model has no errors.
To add, when the R^2 value is 0 and the model is linear regression, it can also indirectly indicate that there is no linear relationship between features and labels.
This is also the principle behind the commonly used multicollinearity indicator VIF, which attempts to use each feature as a label and learns to fit with other features to obtain the linear model R^2 value, calculating VIF. A VIF of 1 indicates that there is no multicollinearity between features (multicollinearity affects the stability and interpretability of linear models, and a threshold of VIF < 10 is commonly used in engineering).
1.2 Error Evaluation Metrics for Classification Models
For classification model errors, loss functions (such as cross-entropy) can be used for evaluation. In classification models, cross-entropy is more appropriate than MSE; simply put, MSE indiscriminately focuses on the differences between predicted probabilities and true probabilities across all categories, while cross-entropy focuses on the predicted probability of the correct category.

However, using loss functions to evaluate classification performance is not very intuitive, so evaluation of classification tasks often also uses F1-score, precision, and recall, which can directly show the correct classification situation of various categories.
-
Precision, Recall, F1-score, Accuracy

Accuracy (accuracy). This is the proportion of all correct predictions (TP+TN) to the total number (TP+FP+TN+FN);
Precision P (precision): This refers to the number of correct samples (TP) predicted as Positive by the classifier, divided by the total number of samples predicted as Positive (TP+FP);
Recall R (recall): This refers to the number of correct samples (TP) predicted as Positive by the classifier, divided by the total number of actual Positive samples (TP+FN).
F1-score is the harmonic mean of precision P and recall R:
Summary of the above indicators:
① Comprehensive accuracy of all categories: Accuracy (accuracy) provides a direct description of classification errors. However, in cases of imbalance between positive and negative examples, the accuracy evaluation has little reference value. For example, in a fraud detection scenario, there may be 950 normal user samples (negative examples) and 50 abnormal users (positive examples). If the model predicts all samples as normal users, the accuracy would be very high at 95%. However, in reality, the classification effect is very poor. Accuracy cannot express the error classification situation of minority categories, so F1-score is more commonly used, as it comprehensively considers both precision and recall.
② Balancing precision and recall: Precision and recall are often contradictory indicators. Sometimes, it is necessary to choose to be “more precise” or “more comprehensive” based on business considerations (for example, in fraud detection scenarios, there is often a bias towards identifying more positive examples, even at the cost of higher false positives. “Better to mistakenly kill a hundred than to let one go”). In this case, the precision-recall curve (P-R curve) under different threshold divisions can be used to balance the two.
-
Kappa Value Kappa is an indicator used for consistency testing (for classification problems, consistency refers to whether the model’s prediction results are consistent with the actual classification results). The kappa value is also calculated based on the confusion matrix, and it is an indicator that can penalize the model’s prediction bias. According to the kappa calculation formula, the more unbalanced the confusion matrix (i.e., the greater the differences in prediction accuracy among different categories), the lower the kappa value.

The formula can be interpreted as the ratio of total accuracy compared to random accuracy improvement and perfect model compared to random accuracy improvement:
Kappa values range from -1 to 1 and are usually greater than 0. They can be divided into five groups to represent different levels of consistency: 0.0~0.20 indicates very low consistency (slight), 0.21~0.40 indicates fair consistency, 0.41~0.60 indicates moderate consistency, 0.61~0.80 indicates substantial consistency, and 0.81~1 indicates almost perfect consistency.
-
ROC Curve, AUC The ROC curve (Receiver Operating Characteristic curve) is actually a comprehensive result of multiple confusion matrices. If we do not have a fixed threshold in the model above, but instead sort the model’s predicted results from high to low and use each probability value as a dynamic threshold, we will have multiple confusion matrices.
For each confusion matrix, we calculate two indicators: TPR (True Positive Rate) and FPR (False Positive Rate), where TPR=TP/(TP+FN)=Recall, and FPR=FP/(FP+TN). The final ROC curve is plotted with FPR on the x-axis and TPR on the y-axis.By calculating the area under the ROC curve, we obtain the AUC (Area Under Curve), which can intuitively evaluate the performance of the classifier, usually ranging between 0.5 and 1, with larger values being better.
Analysis summary of the AUC indicator:
-
Since the ROC measures “dynamic thresholds”, AUC does not depend on the classification threshold, freeing it from the limitations of observing classification effects based on fixed classification thresholds.
-
The ROC curve is plotted using different thresholds for TPR and FPR. A larger area under the ROC curve (AUC) means a larger TPR at a smaller FPR, and a smaller FPR corresponds to a larger 1-FPR = TN/(TN+FP)=TNR. Thus, AUC actually considers both TPR (also called recall, sensitivity) and TNR (also called specificity).
-
From the confusion matrix, it can be seen that AUC’s TNR (i.e., 1-FPR), TPR, and the actual good-bad ratio of samples are unrelated. They only focus on the comprehensiveness of recognizing the respective actual categories (unlike precision, which evaluates across actual categories). In simple terms: AUC is insensitive to the ratio of positive and negative samples; even if the ratio of positive to negative samples changes significantly, the area of the ROC curve will not change much.
-
AUC is the area under the ROC curve, and its numerical physical meaning is: the probability that a randomly given positive and negative sample will have the predicted score of the positive sample greater than that of the negative sample. In other words, AUC is an indicator of the “ranking ability” of discrimination (the probability score of positive samples being higher than negative samples), and it is not sensitive to specific judgment probabilities—ignoring the model’s fitting effect. For an excellent model, we expect the probability values of positive and negative samples to be sufficiently different. For example, if a model predicts all negative samples as 0.49 and positive samples as 0.51, then this model’s AUC would be 1 (but the probabilities of positive and negative samples are very close, so any disturbance would lead to mispredictions). We expect the model’s prediction to have a large gap, such as negative samples being predicted below 0.1 and positive samples above 0.8. In this case, although the AUC is the same, this model has a better fitting effect and is more robust.

Differences between AUC and F1-score
-
AUC does not depend on classification thresholds, while F1-score requires a specified threshold, leading to different results at different thresholds; -
When the ratio of positive and negative samples changes, AUC is not greatly affected, while F1-score can be significantly impacted (because precision evaluates across actual categories); -
Both include recall (the comprehensiveness of recognizing positive samples) and consider FP (the misclassification of negative samples as positive), requiring a balance of “completeness” and “precision”. -
F1-score can flexibly adjust the emphasis on recall and precision through thresholds, while AUC only provides a general overview.
# The above indicators can be directly called from sklearn.metrics
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score, roc_curve, auc, cohen_kappa_score, mean_squared_error
...
yhat = model.predict(x)
f1_score(y, yhat)
2. Model Fitting Degree
For the fitting degree of the model, terms like underfitting, well-fitted, and overfitting are commonly used. Generally, well-fitted models have better generalization abilities and perform better on unknown data (test sets).
We can assess the fitting degree of the model through the errors of the training and validation sets (such as loss functions). From the overall training process, underfitting occurs when both the training error and validation set error are high, decreasing as training time and model complexity increase. After reaching an optimal fitting critical point, the training error decreases while the validation set error increases, at this point, the model enters the overfitting region. In practice, underfitting is usually not a problem and can be improved by using strong features and more complex models to enhance learning accuracy. However, addressing overfitting—reducing generalization errors and enhancing generalization ability—is usually the key to optimizing model performance. Common methods to address overfitting include improving data quality, quantity, and employing appropriate regularization strategies. For specifics, see the series of articles: A Deep Dive into Solving Model Overfitting.
3. Model Stability
If the deployed model is unstable, it means the model is uncontrollable, affecting the rationality of decisions. For business, this represents a form of uncertainty risk, which is unacceptable (especially in risk-averse fields like risk control).
We usually use the Population Stability Index (PSI) to measure whether the distribution ratios of scores between future (test set) samples and model training samples remain consistent, to evaluate model stability. Similarly, PSI can also be used to measure the distribution differences of feature values to assess stability at the data feature level.
The PSI calculation uses the model scores from the training samples as a reference point for stability (expected score proportions), measuring the error between the actual predicted scores (actual distribution proportions). The calculation formula is SUM(each score segment’s (actual proportion – expected proportion) * ln(actual proportion / expected proportion))The specific calculation steps and example code are as follows:

import math
import numpy as np
import pandas as pd
def calculate_psi(base_list, test_list, bins=20, min_sample=10):
try:
base_df = pd.DataFrame(base_list, columns=['score'])
test_df = pd.DataFrame(test_list, columns=['score'])
# 1. Remove missing values and count the number of samples in both distributions
base_notnull_cnt = len(list(base_df['score'].dropna()))
test_notnull_cnt = len(list(test_df['score'].dropna()))
# Empty bins
base_null_cnt = len(base_df) - base_notnull_cnt
test_null_cnt = len(test_df) - test_notnull_cnt
# 2. Minimum number of bins
q_list = []
if type(bins) == int:
bin_num = min(bins, int(base_notnull_cnt / min_sample))
q_list = [x / bin_num for x in range(1, bin_num)]
break_list = []
for q in q_list:
bk = base_df['score'].quantile(q)
break_list.append(bk)
break_list = sorted(list(set(break_list))) # Remove duplicates and sort
score_bin_list = [-np.inf] + break_list + [np.inf]
else:
score_bin_list = bins
# 4. Count the number of samples in each bin
base_cnt_list = [base_null_cnt]
test_cnt_list = [test_null_cnt]
bucket_list = ["MISSING"]
for i in range(len(score_bin_list)-1):
left = round(score_bin_list[i+0], 4)
right = round(score_bin_list[i+1], 4)
bucket_list.append("(" + str(left) + ',' + str(right) + ']')
base_cnt = base_df[(base_df.score > left) & (base_df.score <= right)].shape[0]
base_cnt_list.append(base_cnt)
test_cnt = test_df[(test_df.score > left) & (test_df.score <= right)].shape[0]
test_cnt_list.append(test_cnt)
# 5. Summary statistics results
stat_df = pd.DataFrame({"bucket": bucket_list, "base_cnt": base_cnt_list, "test_cnt": test_cnt_list})
stat_df['base_dist'] = stat_df['base_cnt'] / len(base_df)
stat_df['test_dist'] = stat_df['test_cnt'] / len(test_df)
def sub_psi(row):
# 6. Calculate PSI
base_list = row['base_dist']
test_dist = row['test_dist']
# Handle the situation where the number of samples in a bin is 0
if base_list == 0 and test_dist == 0:
return 0
elif base_list == 0 and test_dist > 0:
base_list = 1 / base_notnull_cnt
elif base_list > 0 and test_dist == 0:
test_dist = 1 / test_notnull_cnt
return (test_dist - base_list) * np.log(test_dist / base_list)
stat_df['psi'] = stat_df.apply(lambda row: sub_psi(row), axis=1)
stat_df = stat_df[['bucket', 'base_cnt', 'base_dist', 'test_cnt', 'test_dist', 'psi']]
psi = stat_df['psi'].sum()
except:
print('error!!!')
psi = np.nan
stat_df = None
return psi, stat_df
## You can also directly call the toad package to calculate PSI
# prob_dev model scores in training samples, prob_test model scores in test samples
psi = toad.metrics.PSI(prob_dev,prob_test)
Analyzing the principle of the PSI indicator, we can find that the meaning of PSI is equivalent to the KL divergence between the actual distribution (A) and the expected distribution (E) + the KL divergence between the expected distribution (E) and the actual distribution (A), which can asymmetrically describe the differences in information entropy, thus providing a more comprehensive description of distribution differences.

The smaller the PSI value (commonly < 0.1 as a standard), the smaller the difference between the two distributions, indicating greater stability. The advantage of PSI in practical applications lies in its computational convenience. However, it is important to note that the PSI calculation is influenced by multiple factors, including the number and method of grouping, population sample size, and actual business policies. Particularly for small samples with significant business fluctuations, PSI values often exceed general empirical levels, thus requiring specific analysis in conjunction with actual business and data conditions.
Download 1: OpenCV-Contrib Extension Module Chinese Version Tutorial
Reply "Extension Module Chinese Tutorial" in the backend of the "Beginner Learning Vision" public account to download the first Chinese version of the OpenCV extension module tutorial available online, covering installation of extension modules, SFM algorithms, stereo vision, target tracking, biological vision, super-resolution processing, and more than twenty chapters of content.
Download 2: Python Vision Practical Projects 52 Lectures
Reply "Python Vision Practical Projects" in the backend of the "Beginner Learning Vision" public account to download 31 vision practical projects including image segmentation, mask detection, lane line detection, vehicle counting, eyeliner addition, license plate recognition, character recognition, emotion detection, text content extraction, and face recognition, to help quickly learn computer vision.
Download 3: OpenCV Practical Projects 20 Lectures
Reply "OpenCV Practical Projects 20 Lectures" in the backend of the "Beginner Learning Vision" public account to download 20 practical projects based on OpenCV for advanced learning of OpenCV.
Group Chat
Welcome to join the public account reader group to communicate with peers. Currently, there are WeChat groups for SLAM, 3D vision, sensors, autonomous driving, computational photography, detection, segmentation, recognition, medical imaging, GAN, algorithm competitions, etc. (These will be gradually refined in the future). Please scan the WeChat ID below to join the group, and note: "Nickname + School/Company + Research Direction", for example: "Zhang San + Shanghai Jiao Tong University + Vision SLAM". Please follow the format for notes; otherwise, you will not be approved. After successfully adding, you will be invited to join relevant WeChat groups based on your research direction. Please do not send advertisements in the group, or you will be removed from the group. Thank you for your understanding~