This article is an excellent article from the KX forum, author ID: 大大薇薇
Review of Previous Articles
python_mmdt: A Python Library for Generating Feature Vectors Based on Sensitive Hashing (Part 1)
(https://bbs.pediy.com/thread-265211.htm)
We introduced a method called mmdthash (sensitive hashing) and provided a basic introduction to its concepts.
python_mmdt: From 0 to 1 – Implementing a Simple Malware Classifier (Part 2)
(https://bbs.pediy.com/thread-265499.htm)
We introduced a simple malware classifier application based on mmdthash.
python_mmdt: From 1 to 2 – Implementing a KNN-based Machine Learning Malware Classifier (Part 3)
(https://bbs.pediy.com/thread-265860.htm)
We introduced a machine learning malware classifier application based on mmdthash.
python_mmdt: Online Use of mmdthash (Part 4)
(https://bbs.pediy.com/thread-271243.htm)
We introduced how to use mmdthash for online malicious file detection.
python_mmdt: KNN Machine Learning Classification Results Testing and Analysis (Part 5)
(https://bbs.pediy.com/thread-271265.htm)
We conducted statistical testing on the classification results of the KNN machine learning algorithm to evaluate the classification model.
This article compares the effectiveness of four types of sensitive hashing algorithms: ssdeep, tlsh, vhash, and mmdthash.
Project Address
GitHub code address: python_mmdt(https://github.com/a232319779/python_mmdt)
Comparison Conclusions
Accuracy ACC: tlsh > mmdthash > ssdeep > vhash
Recall REC: tlsh > mmdthash > ssdeep > vhash
Precision PRE: mmdthash = ssdeep = vhash > tlsh
Introduction to Sensitive Hashing
Comparison Approach
Comparison Process
1. ssdeep Calculation
# -*- coding: utf-8 -*-
import os
import sys
import hashlib
import json
import ssdeep
# Traverse directory
def list_dir(root_dir):
files = os.listdir(root_dir)
for f in files:
file_path = os.path.join(root_dir, f)
yield file_path
# Generate sha1
def gen_sha1(file_name):
with open(file_name, 'rb') as f:
s = f.read()
_s = hashlib.sha1()
_s.update(s)
return _s.hexdigest()
def main():
# Input the paths of 785 files
file_path = sys.argv[1]
ssdeep_dict = dict()
for file_name in list_dir(file_path):
file_sha1 = gen_sha1(file_name)
ssdeep_hash = ssdeep.hash_from_file(file_name)
print('%s,%s' % (file_sha1, ssdeep_hash))
ssdeep_dict[file_sha1] = ssdeep_hash
# Save results in JSON file
with open('ssdeep_test.json', 'w') as f:
f.write(json.dumps(ssdeep_dict, indent=4))
if __name__ == '__main__':
main()
cat ssdeep_test.json
{
"0ec279513e9e8a0e8f6e7c170b9462b60d9888c6": "6144:w9qaZ5E6fCvH5H42SUiTV2MTb54y94HTFboTWhmzeOws:w9d96yeKV2MTb5X4zZQWhmqd",
"0ad6db9128353742b3d4c8a5fc1993ca8bf399f1": "1536:NxiIXeGNc0BL0IFx34bPMkG/KsrKlEqjjPWUJ7h/dbZkv13t43O:eIXeGNtV0KIQjr5ehlbSv13t43O",
"e3dc592a0fa552beb35ebcb4160e5e4cb4686f17": "1536:qKXppRU0D2KmMESllkQSp5jcUyT/jAdp/hsonBqar5mVNCG:JpGjKm9fQSp5sjAfAa1mVMG",
"c8e1100b1e38e5c5e671a23cd49d98e315b74a36": "3072:XwZcFNCpegr+L3Y5D+LRohyOBGbNc8GMmE/A9VpGLGWtQeGwX1gnuZPZc2:XHCNEY5D+LfOi3GbE/AsAeGwXwc5",
"0ae0cba5b411541cc8d9f94e01151fec9d6b9242": "384:enXKs1aOcWkZ1WgoELXuf9OO5GD+IGA4p1XMWfg7CF:enp1aOasDOOM+ut",
......
}
2. tlsh Calculation
# -*- coding: utf-8 -*-
import os
import sys
import hashlib
import json
import tlsh
# Traverse directory
def list_dir(root_dir):
files = os.listdir(root_dir)
for f in files:
file_path = os.path.join(root_dir, f)
yield file_path
# Generate sha1
def gen_sha1(file_name):
with open(file_name, 'rb') as f:
s = f.read()
_s = hashlib.sha1()
_s.update(s)
return _s.hexdigest()
def gen_tlsh(file_name):
with open(file_name, 'rb') as f:
s = f.read()
_s = tlsh.hash(s)
return _s
def main():
# Input the paths of 785 files
file_path = sys.argv[1]
tlsh_dict = dict()
for file_name in list_dir(file_path):
file_sha1 = gen_sha1(file_name)
tlsh_hash = gen_tlsh(file_name)
print('%s,%s' % (file_sha1, tlsh_hash))
tlsh_dict[file_sha1] = tlsh_hash
with open('tlsh_test.json', 'w') as f:
f.write(json.dumps(tlsh_dict, indent=4))
if __name__ == '__main__':
main()
cat tlsh_test.json
{
"0ec279513e9e8a0e8f6e7c170b9462b60d9888c6": "T1616423D5248C5DF8E251CCF4C73AB60493EADA48BF516B75BDD9C2692FF2480C93A214",
"0ad6db9128353742b3d4c8a5fc1993ca8bf399f1": "T13D73024483EBEDA8EE040AB0124C43B9CBAD8D1B7659653DFD3864D1FC064AE47269A6",
"e3dc592a0fa552beb35ebcb4160e5e4cb4686f17": "T1CF93293D766924E5E139C17CC5474E0AF772B025071227EF06A4C2BE1F97BE06C39AA5",
"c8e1100b1e38e5c5e671a23cd49d98e315b74a36": "T17F34391A57EC0465F1B7923589B34919F233B8625731E2DF109082BC2E27FD8BE36B56",
"0ae0cba5b411541cc8d9f94e01151fec9d6b9242": "T12D5208C71F69F7D4C19F85F84A3B623E1EA4616A6111412057DD3E92BC1C3DBFA2A09C",
......
}
3. vhash Calculation
# -*- coding: utf-8 -*-
import sys
import json
import requests
from time import sleep
# virustotal api key
x_apikey = 'xxxx'
def read_hash(file_name):
with open(file_name, 'r') as f:
datas = f.readlines()
return [file_hash.strip() for file_hash in datas]
def parse_vt_report(vt_report_json):
attributes = vt_report_json.get('data', {}).get('attributes', {})
parse_data = dict()
if attributes:
# Record the file's ssdeep/tlsh/vhash/file type
parse_data['vhash'] = attributes.get('vhash', '')
parse_data['magic'] = attributes.get('magic', '')
parse_data['tlsh'] = attributes.get('tlsh', '')
parse_data['ssdeep'] = attributes.get('ssdeep', '')
return parse_data
def vt_search(sha1_hash):
url = "https://www.virustotal.com/api/v3/files/{}".format(sha1_hash)
headers = {
"Accept": "application/json",
"x-apikey": x_apikey
}
response = requests.request("GET", url, headers=headers)
try:
parse_data = parse_vt_report(response.json())
except Exception as e:
print('error: %s, reason: %s' % (sha1_hash, str(e)))
return parse_data
def main():
# Path containing the hashes to query
file_path = sys.argv[1]
vhash_dict = dict()
file_hashs = read_hash(file_path)
for file_hash in file_hashs:
parse_data = vt_search(file_hash)
print('%s,%s' % (file_hash, json.dumps(parse_data)))
if parse_data:
vhash_dict[file_hash] = parse_data
else:
break
sleep(1)
with open('vhash_test.json', 'w') as f:
f.write(json.dumps(vhash_dict, indent=4))
if __name__ == '__main__':
main()
cat vhash_test.json
{
"aba1301af627506cf67fd61410800b37c973dcb6": {
"vhash": "1240451d05151\"z",
"magic": "PE32+ executable for MS Windows (DLL) (console)",
"tlsh": "T151B22A828BB81403FA767D7013A8D6837D3D67D60820856915AAF5AA2C833C5EF10F7E",
"ssdeep": "192:8fPNlWZYWfUyfUlHDBQABJB3ejpC52qnaj68tj:iNlWZYW+DBRJ4Nle8tj"
},
"5f3ebf2c443f7010d3a5c2e5fa77c62b03ca1279": {
"vhash": "1240451d05151\"z",
"magic": "PE32+ executable for MS Windows (DLL) (console)",
"tlsh": "T140B239D6CBBC0547E9663EB0124A8E9873D3E73EB4820416905A5F1981C837C5EF00F6E",
"ssdeep": "192:8Ih6WxwWFUyfUlHDBQABJj1N80Hy5qnajWi8sA+F:Vh6WxwW0DBRJjPsl+yF"
},
"3d57ce2f5149f1d9609608bc732d86637fe20cce": {
"vhash": "1240451d05151\"z",
"magic": "PE32+ executable for MS Windows (DLL) (console)",
"tlsh": "T18FB23AC2CBEC5443EAA67A7043A8E58B7D3DB3D21C60855904A6E1591CD33C2EF24E7E",
"ssdeep": "192:8JWhOMrlWBwWYUyfUlHDBQABJ5cWvKxEHsqnajTT0f7:kWhOMRWBwWhDBRJNKxUsl3TM"
},
......
}
4. mmdthash Calculation
Result Comparison
1. Integration of Results from ssdeep, tlsh, vhash, and mmdthash
{ "aba1301af627506cf67fd61410800b37c973dcb6": { "vhash": "1240451d05151\"z", "magic": "PE32+ executable for MS Windows (DLL) (console)", "tlsh": "T151B22A828BB81403FA767D7013A8D6837D3D67D60820856915AAF5AA2C833C5EF10F7E", "ssdeep": "192:8fPNlWZYWfUyfUlHDBQABJB3ejpC52qnaj68tj:iNlWZYW+DBRJ4Nle8tj", "mmdthash": "07022B59:7202890402200212DA032EC310AFEF8A" }, "5f3ebf2c443f7010d3a5c2e5fa77c62b03ca1279": { "vhash": "1240451d05151\"z", "magic": "PE32+ executable for MS Windows (DLL) (console)", "tlsh": "T140B239D6CBBC0547E9663EB0124A8E9873D3E73EB4820416905A5F1981C837C5EF00F6E", "ssdeep": "192:8Ih6WxwWFUyfUlHDBQABJj1N80Hy5qnajWi8sA+F:Vh6WxwW0DBRJjPsl+yF", "mmdthash": "07022B59:7102870402200212DD032DC30EA0F1A9" }, ......}
2. Sensitive Hash Similarity Calculation
# -*- coding: utf-8 -*-
import json
import ssdeep
import tlsh
def read_hash(file_name):
with open(file_name, 'r') as f:
datas = f.readlines()
return [file_hash.strip() for file_hash in datas]
def ssdeep_compare(data1, data2):
h1 = data1.get('ssdeep', '')
h2 = data2.get('ssdeep', '')
score = ssdeep.compare(h1, h2)
return score/100.0
def tlsh_compare(data1, data2):
h1 = data1.get('tlsh', '')
h2 = data2.get('tlsh', '')
score = tlsh.diff(h1, h2)
return 1 - score/1160.0
def vhash_compare(data1, data2):
h1 = data1.get('tlsh', '')
h2 = data2.get('tlsh', '')
score = 1.0 if h1 == h2 else 0.0
return score
def main():
mmdt_hash_sim = read_hash('./mmdt_sim.csv')
with open('./ssdeep_tlsh_vhash_mmdthash_test.json', 'r') as f:
vhash_json = json.loads(f.read())
print('Original File, Similar File, mmdt Similarity, ssdeep Similarity, tlsh Similarity, vhash Similarity, Original File Type, Similar File Type')
for mhs in mmdt_hash_sim:
tmp = mhs.split(',')
ori_hash = tmp[0]
sim_hash = tmp[1]
mmdt_sim = float(tmp[2])
ori_data = vhash_json[ori_hash]
sim_data = vhash_json[sim_hash]
ssdeep_sim = ssdeep_compare(ori_data, sim_data)
tlsh_sim = tlsh_compare(ori_data, sim_data)
vhash_sim = vhash_compare(ori_data, sim_data)
ori_type = ori_data.get('magic', '').split(' ')[0]
sim_type = sim_data.get('magic', '').split(' ')[0]
print('%s,%s,%.3f,%.3f,%.3f,%.3f,%s,%s' % (
ori_hash,sim_hash,mmdt_sim,ssdeep_sim,tlsh_sim,vhash_sim,ori_type,sim_type
))
if __name__ == '__main__':
main()

3. Result Analysis




tlsh has the highest accuracy, recall, and lowest false positive rate (false positive rate = 1.0 – accuracy)
mmdthash has the second highest accuracy and recall, with accuracy 1.5% lower than tlsh, recall 9.0% lower than tlsh, and false positive rate 5.5% lower than tlsh
ssdeep has the third highest accuracy and recall, with accuracy 6.8% lower than tlsh, recall 21.4% lower than tlsh, and false positive rate 5.5% lower than tlsh
vhash is special, and the data can be observed for comparison.
Others
1. Distribution of 400 Test File Types
PE files account for 96%
ELF files account for 2%
Other files account for 2%
2. Calculation Time for 400 Test Files


Reflections and Gains
① The ssdeep paper is really well written, easy to understand, with a clear argumentation process and very sufficient mathematical theory support, excellent.
② The tlsh ecosystem is really well developed, having published 6 related papers and participated in 6 conferences, which proves its application effectiveness.
③ In the process of implementing mmdthash, I was too closed off, and many immature ideas and thoughts were clearly and thoroughly discussed in the tlsh-related papers, and the applications are also very mature.
④ The performance of mmdthash needs to be continuously optimized.
KX ID: 大大薇薇
https://bbs.pediy.com/user-home-467421.htm

# Previous Recommendations
1. malloc source code analysis
2. Windows local code execution vulnerability (CVE-2012-1876) analysis on x86/x64 platforms
3. Stack overflow principles and practical reading notes
4. Symbolic execution mining open-source library command injection
5. CVE-2021-4034 pkexec local privilege escalation vulnerability reproduction and principle analysis
6. Grand Theft Auto – Mastering Rolling Codes


Share

Like

Watching