KNN Machine Learning Classification Result Testing and Analysis

This article is an outstanding article from the Kexue Forum, Author ID: 大大薇薇

Overview

python_mmdt: A Python library that generates feature vectors based on sensitive hashing (Part 1)We introduce a method called mmdthash (sensitive hashing) and provide a basic introduction to its concepts.

python_mmdt: From 0 to 1 – Implementing a Simple Malicious Code Classifier (Part 2)We introduce a simple malicious code classifier application based on mmdthash.

python_mmdt: From 1 to 2 – Implementing a KNN-Based Machine Learning Malicious Code Classifier (Part 3)We introduce a machine learning malicious code classifier application based on mmdthash.

python_mmdt: Online Use of mmdthash (Part 4)We introduce how to use mmdthash for online malicious file detection.

In this article, we statistically test the classification results based on the KNN machine learning algorithm and evaluate the classification model.

Evaluation Conclusion

1. The KNN model has a total of 632,253 mmdthash records, covering 1,150 malicious family labels;

2. Based on the KNN model’s multi-classification results, the accuracy (ACC) reaches over 80%, and the recall (REC) exceeds 60%;

3. The similarity determination threshold can be set at 0.95: results with a similarity of no less than 0.95 are highly credible (precision (PRE) reaches 100%); results below 0.95 gradually decrease in credibility;

4. When the similarity threshold is set to 0.95, the KNN model found no false positives.

Project Information

GitHub code address: python_mmdt

KNN Model Result Analysis

Basic Information

1. KNN model information: a total of 632,253 mmdthash records, covering 1,150 malicious family labels, of which binary files account for over 90%.

2. Test file set information: randomly selected 200 black files (not in the KNN model) and 200 white files, totaling 400, with over 95% being binary files. Among the 200 black files, 58 malicious families are covered, plus 1 clean file family, totaling 59 family classifications.

3. Test method description: Using 400 test files, through the web interface (http://146.56.242.184/mmdt/scan), obtain the most similar samples and information provided by the KNN model, combine similarity determination (threshold determination), compare predicted labels with actual labels, and calculate accuracy (ACC), precision (PRE), and recall (REC) to evaluate the performance of the KNN model based on mmdthash.

KNN Model Information

KNN model name: mmdt_feature_20220120.data

KNN model size: 23M

Number of mmdthash features: 632,253

Number of malicious families: 1,150

KNN model family Top 10 list

Test File Set Information

Test dataset name: mmdt_feature_test_400.data

Test set file size: 21K

Test dataset download address: download

Number of mmdthash features: 400

Number of classified families: 59

Test set family Top 10 list

Test Method

1. Use the python_mmdt library to encode and calculate mmdthash for 400 files, generating the test dataset mmdt_feature_test_400.data;

2. Use the python_mmdt library to encode and request web services, classifying the 400 test dataset entries using the KNN model;

3. Count KNN model classification results, with statistics dimensions being TP, TN, FP, FN, and calculation dimensions including accuracy (ACC), precision (PRE), and recall (REC):

TP indicates the number of actual labels that are true (malicious) and predicted labels that are true (malicious)
TN indicates the number of actual labels that are false (clean) and predicted labels that are false (clean)
FP indicates the number of actual labels that are false (clean) and predicted labels that are true (malicious)
FN indicates the number of actual labels that are true (malicious) and predicted labels that are false (clean)
Accuracy (ACC) indicates the proportion of all correctly predicted labels (can be understood as the detection rate of the KNN model), calculated as: ACC = (TP + TN)/(TP + TN + FP + FN)
Precision (PRE) indicates the proportion of all actual true (malicious) numbers in the predicted true (malicious) numbers (can be understood as the credibility of the KNN model’s detection of malicious samples), calculated as: PRE = TP/(TP + FP)
Recall (REC) indicates the proportion of all predicted true (malicious) numbers in the actual true (malicious) numbers (can be understood as the KNN model’s detection coverage capability for malicious samples), calculated as: REC = TP/(TP + FN)

Determination Basis

1. Accuracy (ACC) measures the effectiveness of the classifier; the higher the accuracy (ACC), the better the classifier’s performance;

2. Precision (PRE) measures the classifier’s correct detection rate of true cases (malicious); the higher the precision (PRE), the higher the correct detection rate of true cases (malicious), and the higher the model’s credibility;

3. Recall (REC) measures the classifier’s detection coverage rate of true cases (malicious); the higher the recall (REC), the higher the detection coverage rate of true cases (malicious), and the higher the model’s detection capability.

Special Note

1. In this statistical test, two thresholds for similarity are compared, greater than 0.95 (and 0.90), then judged to be true (malicious); otherwise judged to be false (clean).

2. The KNN model and the test file set are both multi-label; in practical use, they are equivalent to multi-classification. Therefore, in this statistical test, under the premise of being judged as malicious, the predicted label must also match the actual label to be counted as TP. In this case, the calculated accuracy, precision, recall, etc., will be lower than the binary classification (non-white or black) statistics.

Test Code

# -*- coding: utf-8 -*-import sysimport requestsfrom python_mmdt.mmdt.common import mmdt_load # Similarity determination threshold, two thresholds, 0.95 and 0.90dlt = 0.95 def mmdt_scan_online_check():    file_name = sys.argv[1]    # Load test data    features = mmdt_load(file_name)    # 4 indicators    TP = 0    TN = 0    FP = 0    FN = 0    count = 0    print('Detection result, file md5, actual label, similar files, predicted label, similarity')    for feature in features:        count += 1        tmp = feature.strip().split(":")        file_mmdt = ':'.join(tmp[:2])        tag = tmp[2]        file_sha1 = tmp[3]        data = {            "md5": file_sha1,            "sha1": file_sha1,            "file_name": file_sha1,            "mmdt": file_mmdt,            "data": {}        }        r = requests.post(url='http://146.56.242.184/mmdt/scan', json=data)        r_data = r.json()        if r_data.get('status', 0) == 20001:            status = r_data.get('status', 0)            message = r_data.get('message', '')            print('File md5: %s, Status Code: %d, Submission Info: %s' % (file_sha1, status, message))        else:            label = r_data.get('data', {}).get('label', 'unknown')            sim_hash = r_data.get('data', {}).get('similars', [])[0].get('hash', 'None')            sim = r_data.get('data', {}).get('similars', [])[0].get('sim', 0.0)            check_result = ''            # Statistical hidden condition, the actual label must match the predicted label to be counted as TP, recorded as correct classification            if tag == label and sim > dlt:                TP += 1                check_result = 'Correct'            elif tag == 'clean' and sim > dlt:                FP += 1                check_result = 'Incorrect'            elif tag == 'clean' and sim <= dlt:                TN += 1                check_result = 'Correct'            else:                FN += 1                check_result = 'Incorrect'            print('%s,%s,%s,%s,%s,%.5f' % (check_result, file_sha1, tag, sim_hash, label, sim))        if count >= 500:            break    print('Total mmdthash tested: %d' % count)    print('Total correct detections: %d' % (TP + TN))    print('Total incorrect detections: %d' % (FP + FN))    print('Total TP detected: %d' % TP)    print('Total TN detected: %d' % TN)    print('Total FP detected: %d' % FP)    print('Total FN detected: %d' % FN)    print('Detection accuracy (ACC): %.3f' % ((TP + TN)/(TP + TN + FP + FN)))    print('Detection precision (PRE): %.3f' % (TP/(TP + FP)))    print('Detection recall (REC): %.3f' % (TP/(TP + FN)))  def main():    mmdt_scan_online_check()  if __name__ == '__main__':    main()

Test Results are as follows:

KNN Machine Learning Classification Result Testing and Analysis

Conclusions are as follows:

The accuracy (ACC) of the KNN model detection is above 80%, and the recall (REC) is above 60%
The accuracy (ACC) at the 0.95 threshold improved by 3.68‰ compared to the accuracy (ACC) at the 0.90 threshold, reaching 81.8%
The precision (PRE) at the 0.95 threshold improved by 83.42‰ compared to the precision (PRE) at the 0.90 threshold, reaching 100%
The recall (REC) at the 0.95 threshold decreased by 72.99‰ compared to the recall (REC) at the 0.90 threshold, reaching 63.5%
At the 0.95 threshold, no false positives were found (FP=0)
In summary, 0.95 can be set as the initial determination threshold: results no less than 0.95 can be preliminarily judged as credible; results below 0.95 can be preliminarily judged as not credible

Sample Analysis of Classification Errors

1. Analysis of errors at the 0.95 threshold

Errors with a similarity of 1: gandcrypt and gandcrab belong to different names of the same malicious family, detection results with a similarity of 1 are correct;
Errors with a similarity greater than 0.99 and less than 1: due to incorrect labeling of the agent sample, through VT query, the agent sample is actually klez, detection results are correct;
Errors with a similarity greater than 0.95 and less than 0.99: two files differ significantly, mmdthash issue, confirming detection results are incorrect.

2. Analysis of errors at the 0.90 threshold

Errors with a similarity greater than 0.95: same classification errors as at the 0.95 threshold;
Errors with a similarity greater than 0.94 and less than 0.95: detection results confirmed to be all incorrect, such as associating UPX shell files with Delphi compiled files, associating NSIS files with Word files, associating PE files with Excel files;
Errors with a similarity below 0.94: the vast majority of detections are incorrect, with many reasons for the errors, including a small number of correct results.

Notes

1. In the future, the number of mmdthash in the KNN model can be expanded for further testing;

2. During the analysis of detection test results, it was found that some specific families or samples from specific periods had no differences in the calculated mmdthash, leading to the KNN model’s most similar samples being almost fixed. In such cases, filtering the base library data may be considered to reduce quantity, decrease size, optimize query speed, and reduce memory usage.

KNN Machine Learning Classification Result Testing and Analysis

Kexue ID: 大大薇薇

https://bbs.pediy.com/user-home-467421.htm

*This article is original by Kexue Forum 大大薇薇, please indicate the source from Kexue Community when reprinting.

# Previous Recommendations

1. A General Method for Integrating LLVM Pass into NDK

2. Artificial Intelligence Competition – House Price Prediction

3. Windows PrintNightmare Vulnerability Reproduction Analysis

4. GKCTF2021 KillerAid

5. Initial Exploration of Kernel, Container, and eBPF Attack and Defense

6. Reproduction Learning of CVE-2019-10999

Share

Like

Watching