Comparison of ssdeep, tlsh, vhash, and mmdthash

This article is an excellent article from the KX forum, author ID: 大大薇薇

Review of Previous Articles

python_mmdt: A Python Library for Generating Feature Vectors Based on Sensitive Hashing (Part 1)

(https://bbs.pediy.com/thread-265211.htm)

We introduced a method called mmdthash (sensitive hashing) and provided a basic introduction to its concepts.

python_mmdt: From 0 to 1 – Implementing a Simple Malware Classifier (Part 2)

(https://bbs.pediy.com/thread-265499.htm)

We introduced a simple malware classifier application based on mmdthash.

python_mmdt: From 1 to 2 – Implementing a KNN-based Machine Learning Malware Classifier (Part 3)

(https://bbs.pediy.com/thread-265860.htm)

We introduced a machine learning malware classifier application based on mmdthash.

python_mmdt: Online Use of mmdthash (Part 4)

(https://bbs.pediy.com/thread-271243.htm)

We introduced how to use mmdthash for online malicious file detection.

python_mmdt: KNN Machine Learning Classification Results Testing and Analysis (Part 5)

(https://bbs.pediy.com/thread-271265.htm)

We conducted statistical testing on the classification results of the KNN machine learning algorithm to evaluate the classification model.

This article compares the effectiveness of four types of sensitive hashing algorithms: ssdeep, tlsh, vhash, and mmdthash.

Project Address

GitHub code address: python_mmdt(https://github.com/a232319779/python_mmdt)

Comparison Conclusions

Accuracy ACC: tlsh > mmdthash > ssdeep > vhash

Recall REC: tlsh > mmdthash > ssdeep > vhash

Precision PRE: mmdthash = ssdeep = vhash > tlsh

Based on the test results of this article, with the mmdthash threshold set at 0.95, ssdeep at 0.8, and tlsh at 0.8, the comprehensive ranking of sensitive hashing effectiveness is as follows:

tlsh > mmdthash > ssdeep > vhash

Introduction to Sensitive Hashing

Overview of Four Types of Sensitive Hashing:

CTPH(ssdeep)（https://ssdeep-project.github.io/ssdeep/index.html）:

Context Triggered Piecewise Hashes (CTPH), also known as fuzzy hashing, was first proposed by Dr. Jesse Kornblum in 2006, with the paper available at (https://ssdeep-project.github.io/ssdeep/index.html). CTPH can be used for determining the provenance of files/data. According to the official documentation, its calculation speed is twice that of tlsh (however, testing seems to indicate otherwise).

tlsh（https://tlsh.org/index.html）:

It is an open-source fuzzy hashing computation tool developed by Trend Micro, which calculates a hash value for data over 50 bytes, determining the provenance relationship between original files by calculating the similarity of hash values. According to the official documentation, tlsh is harder to attack and bypass compared to other fuzzy hashing algorithms like ssdeep and sdhash.

vhash（https://developers.virustotal.com/reference/files）:

After searching through the entire virustotal documentation, I found this single statement: “an in-house similarity clustering algorithm value, based on a simple structural feature hash allows you to find similar files,” which suggests it is an internal similarity clustering algorithm that allows you to find similar samples based on this simple value.

mmdthash（https://github.com/a232319779/python_mmdt）:

It is an open-source fuzzy hashing computation tool that generates a fuzzy hash value for any data and determines the correlation between two data sets by calculating the similarity of fuzzy hash values. More details can be found in the previous articles 1-5.

Comparison Approach

Based on the mmdthash test data and results from the article python_mmdt: KNN Machine Learning Classification Results Testing and Analysis (Part 5), we conducted comparative tests on ssdeep, tlsh, and vhash. This involved calculating the similarity of ssdeep, tlsh, and vhash for two samples associated with mmdthash, and statistically analyzing related outliers, accuracy, recall, and precision to derive comparison results between sensitive hashing algorithms.

Comparison Process

Note: The installation of ssdeep and tlsh on Windows can be quite troublesome, so the testing was conducted directly on a Raspberry Pi Linux environment.

1. ssdeep Calculation

ssdeep Installation:

Install the fuzzy hashing library: sudo apt-get -y install libfuzzy-dev ssdeep

Install the ssdeep Python library: pip install ssdeep

Alternatively, if the Linux compilation environment is complete (including tools like automake), you can install the ssdeep fuzzy hashing library directly via pip: BUILD_LIB=1 pip install ssdeep

Using Python’s ssdeep library to calculate the ssdeep values of 785 test files and save them in JSON format, the code is as follows:

# -*- coding: utf-8 -*-
import os
import sys
import hashlib
import json
import ssdeep

# Traverse directory
def list_dir(root_dir):
    files = os.listdir(root_dir)
    for f in files:
        file_path = os.path.join(root_dir, f)
        yield file_path

# Generate sha1
def gen_sha1(file_name):
    with open(file_name, 'rb') as f:
        s = f.read()
        _s = hashlib.sha1()
        _s.update(s)
        return _s.hexdigest()

def main():
    # Input the paths of 785 files
    file_path = sys.argv[1]
    ssdeep_dict = dict()
    for file_name in list_dir(file_path):
        file_sha1 = gen_sha1(file_name)
        ssdeep_hash = ssdeep.hash_from_file(file_name)
        print('%s,%s' % (file_sha1, ssdeep_hash))
        ssdeep_dict[file_sha1] = ssdeep_hash
    # Save results in JSON file
    with open('ssdeep_test.json', 'w') as f:
        f.write(json.dumps(ssdeep_dict, indent=4))

if __name__ == '__main__':
    main()

Example of ssdeep calculation results:

cat ssdeep_test.json
{
    "0ec279513e9e8a0e8f6e7c170b9462b60d9888c6": "6144:w9qaZ5E6fCvH5H42SUiTV2MTb54y94HTFboTWhmzeOws:w9d96yeKV2MTb5X4zZQWhmqd",
    "0ad6db9128353742b3d4c8a5fc1993ca8bf399f1": "1536:NxiIXeGNc0BL0IFx34bPMkG/KsrKlEqjjPWUJ7h/dbZkv13t43O:eIXeGNtV0KIQjr5ehlbSv13t43O",
    "e3dc592a0fa552beb35ebcb4160e5e4cb4686f17": "1536:qKXppRU0D2KmMESllkQSp5jcUyT/jAdp/hsonBqar5mVNCG:JpGjKm9fQSp5sjAfAa1mVMG",
    "c8e1100b1e38e5c5e671a23cd49d98e315b74a36": "3072:XwZcFNCpegr+L3Y5D+LRohyOBGbNc8GMmE/A9VpGLGWtQeGwX1gnuZPZc2:XHCNEY5D+LfOi3GbE/AsAeGwXwc5",
    "0ae0cba5b411541cc8d9f94e01151fec9d6b9242": "384:enXKs1aOcWkZ1WgoELXuf9OO5GD+IGA4p1XMWfg7CF:enp1aOasDOOM+ut",
    ......
}

2. tlsh Calculation

tlsh Installation: pip install py-tlsh

Using Python’s tlsh library to calculate the tlsh values of 785 test files and save them in JSON format, the code is as follows:

# -*- coding: utf-8 -*-
import os
import sys
import hashlib
import json
import tlsh

# Traverse directory
def list_dir(root_dir):
    files = os.listdir(root_dir)
    for f in files:
        file_path = os.path.join(root_dir, f)
        yield file_path

# Generate sha1
def gen_sha1(file_name):
    with open(file_name, 'rb') as f:
        s = f.read()
        _s = hashlib.sha1()
        _s.update(s)
        return _s.hexdigest()

def gen_tlsh(file_name):
    with open(file_name, 'rb') as f:
        s = f.read()
        _s = tlsh.hash(s)
        return _s

def main():
    # Input the paths of 785 files
    file_path = sys.argv[1]
    tlsh_dict = dict()
    for file_name in list_dir(file_path):
        file_sha1 = gen_sha1(file_name)
        tlsh_hash = gen_tlsh(file_name)
        print('%s,%s' % (file_sha1, tlsh_hash))
        tlsh_dict[file_sha1] = tlsh_hash
    with open('tlsh_test.json', 'w') as f:
        f.write(json.dumps(tlsh_dict, indent=4))

if __name__ == '__main__':
    main()

Example of tlsh calculation results:

cat tlsh_test.json
{
    "0ec279513e9e8a0e8f6e7c170b9462b60d9888c6": "T1616423D5248C5DF8E251CCF4C73AB60493EADA48BF516B75BDD9C2692FF2480C93A214",
    "0ad6db9128353742b3d4c8a5fc1993ca8bf399f1": "T13D73024483EBEDA8EE040AB0124C43B9CBAD8D1B7659653DFD3864D1FC064AE47269A6",
    "e3dc592a0fa552beb35ebcb4160e5e4cb4686f17": "T1CF93293D766924E5E139C17CC5474E0AF772B025071227EF06A4C2BE1F97BE06C39AA5",
    "c8e1100b1e38e5c5e671a23cd49d98e315b74a36": "T17F34391A57EC0465F1B7923589B34919F233B8625731E2DF109082BC2E27FD8BE36B56",
    "0ae0cba5b411541cc8d9f94e01151fec9d6b9242": "T12D5208C71F69F7D4C19F85F84A3B623E1EA4616A6111412057DD3E92BC1C3DBFA2A09C",
    ......
}

3. vhash Calculation

Virustotal does not provide an open-source method for calculating vhash, so it can only be queried via the virustotal web API. Using the virustotal API, you can also obtain the corresponding ssdeep and tlsh values for the files (it seems that some older samples may lack tlsh values on virustotal). The virustotal API documentation (https://developers.virustotal.com/reference/file-info) allows you to test the API directly on the page and generate corresponding code in various development languages, making it very convenient.

Register for a virustotal account and apply for an api_key according to the virustotal documentation; this can be completed in a few minutes. When querying via the API, make sure to pay attention to the query rate limits.

The following Python code queries virustotal:

# -*- coding: utf-8 -*-
import sys
import json
import requests
from time import sleep

# virustotal api key
x_apikey = 'xxxx'

def read_hash(file_name):
    with open(file_name, 'r') as f:
        datas = f.readlines()
        return [file_hash.strip() for file_hash in datas]

def parse_vt_report(vt_report_json):
    attributes = vt_report_json.get('data', {}).get('attributes', {})
    parse_data = dict()
    if attributes:
        # Record the file's ssdeep/tlsh/vhash/file type
        parse_data['vhash'] = attributes.get('vhash', '')
        parse_data['magic'] = attributes.get('magic', '')
        parse_data['tlsh'] = attributes.get('tlsh', '')
        parse_data['ssdeep'] = attributes.get('ssdeep', '')
    return parse_data

def vt_search(sha1_hash):
    url = "https://www.virustotal.com/api/v3/files/{}".format(sha1_hash)
    headers = {
        "Accept": "application/json",
        "x-apikey": x_apikey
    }
    response = requests.request("GET", url, headers=headers)
    try:
        parse_data = parse_vt_report(response.json())
    except Exception as e:
        print('error: %s, reason: %s' % (sha1_hash, str(e)))
    return parse_data

def main():
    # Path containing the hashes to query
    file_path = sys.argv[1]
    vhash_dict = dict()
    file_hashs = read_hash(file_path)
    for file_hash in file_hashs:
        parse_data = vt_search(file_hash)
        print('%s,%s' % (file_hash, json.dumps(parse_data)))
        if parse_data:
            vhash_dict[file_hash] = parse_data
        else:
            break
        sleep(1)
    with open('vhash_test.json', 'w') as f:
        f.write(json.dumps(vhash_dict, indent=4))

if __name__ == '__main__':
    main()

Example of vhash query results:

cat vhash_test.json
{
    "aba1301af627506cf67fd61410800b37c973dcb6": {
        "vhash": "1240451d05151\"z",
        "magic": "PE32+ executable for MS Windows (DLL) (console)",
        "tlsh": "T151B22A828BB81403FA767D7013A8D6837D3D67D60820856915AAF5AA2C833C5EF10F7E",
        "ssdeep": "192:8fPNlWZYWfUyfUlHDBQABJB3ejpC52qnaj68tj:iNlWZYW+DBRJ4Nle8tj"
    },
    "5f3ebf2c443f7010d3a5c2e5fa77c62b03ca1279": {
        "vhash": "1240451d05151\"z",
        "magic": "PE32+ executable for MS Windows (DLL) (console)",
        "tlsh": "T140B239D6CBBC0547E9663EB0124A8E9873D3E73EB4820416905A5F1981C837C5EF00F6E",
        "ssdeep": "192:8Ih6WxwWFUyfUlHDBQABJj1N80Hy5qnajWi8sA+F:Vh6WxwW0DBRJjPsl+yF"
    },
    "3d57ce2f5149f1d9609608bc732d86637fe20cce": {
        "vhash": "1240451d05151\"z",
        "magic": "PE32+ executable for MS Windows (DLL) (console)",
        "tlsh": "T18FB23AC2CBEC5443EAA67A7043A8E58B7D3DB3D21C60855904A6E1591CD33C2EF24E7E",
        "ssdeep": "192:8JWhOMrlWBwWYUyfUlHDBQABJ5cWvKxEHsqnajTT0f7:kWhOMRWBwWhDBRJNKxUsl3TM"
    },
    ......
}

4. mmdthash Calculation

Using the test results from python_mmdt: KNN Machine Learning Classification Results Testing and Analysis (Part 5).

Result Comparison

1. Integration of Results from ssdeep, tlsh, vhash, and mmdthash

The data generated from the ssdeep_test.json, tlsh_test.json, vhash_test.json, and mmdthash_test.json files are integrated into the ssdeep_tlsh_vhash_mmdthash_test.json file in dictionary format, as shown below:

{    "aba1301af627506cf67fd61410800b37c973dcb6": {        "vhash": "1240451d05151\"z",        "magic": "PE32+ executable for MS Windows (DLL) (console)",        "tlsh": "T151B22A828BB81403FA767D7013A8D6837D3D67D60820856915AAF5AA2C833C5EF10F7E",        "ssdeep": "192:8fPNlWZYWfUyfUlHDBQABJB3ejpC52qnaj68tj:iNlWZYW+DBRJ4Nle8tj",        "mmdthash": "07022B59:7202890402200212DA032EC310AFEF8A"    },    "5f3ebf2c443f7010d3a5c2e5fa77c62b03ca1279": {        "vhash": "1240451d05151\"z",        "magic": "PE32+ executable for MS Windows (DLL) (console)",        "tlsh": "T140B239D6CBBC0547E9663EB0124A8E9873D3E73EB4820416905A5F1981C837C5EF00F6E",        "ssdeep": "192:8Ih6WxwWFUyfUlHDBQABJj1N80Hy5qnajWi8sA+F:Vh6WxwW0DBRJjPsl+yF",        "mmdthash": "07022B59:7102870402200212DD032DC30EA0F1A9"    },    ......}

2. Sensitive Hash Similarity Calculation

Using the classification results from python_mmdt: KNN Machine Learning Classification Results Testing and Analysis (Part 5) as a basis, we calculate the similarity values of ssdeep, tlsh, and vhash between associated files. There are three points to note in the calculation process:

① The calculation results of ssdeep are similarity values between [0,100], where 0 indicates completely unrelated and 100 indicates almost identical. For ease of comparison, in this test, the ssdeep similarity values are normalized to a range of [0,1] by the method: similarity = similarity value / 100.0

② The calculation results of tlsh are distance values between [0,X], where 0 indicates almost identical and X’s upper limit is currently unknown, but a larger distance indicates greater file differences. For ease of comparison, in this test, the tlsh values are normalized to a range of [0,1] by the method: similarity = 1.0 – distance value / 1160.0 (1160.0 is taken as the maximum value from 400 test data).

③ The similarity calculation method for vhash is not publicly disclosed, currently only two values are taken, 0 and 1, where 0 indicates two vhashes are not equal, and 1 indicates two vhashes are equal.

The comparison calculation code is as follows:

# -*- coding: utf-8 -*-
import json
import ssdeep
import tlsh

def read_hash(file_name):
    with open(file_name, 'r') as f:
        datas = f.readlines()
        return [file_hash.strip() for file_hash in datas]

def ssdeep_compare(data1, data2):
    h1 = data1.get('ssdeep', '')
    h2 = data2.get('ssdeep', '')
    score = ssdeep.compare(h1, h2)
    return score/100.0

def tlsh_compare(data1, data2):
    h1 = data1.get('tlsh', '')
    h2 = data2.get('tlsh', '')
    score = tlsh.diff(h1, h2)
    return 1 - score/1160.0

def vhash_compare(data1, data2):
    h1 = data1.get('tlsh', '')
    h2 = data2.get('tlsh', '')
    score = 1.0 if h1 == h2 else 0.0
    return score

def main():
    mmdt_hash_sim = read_hash('./mmdt_sim.csv')
    with open('./ssdeep_tlsh_vhash_mmdthash_test.json', 'r') as f:
        vhash_json = json.loads(f.read())
    print('Original File, Similar File, mmdt Similarity, ssdeep Similarity, tlsh Similarity, vhash Similarity, Original File Type, Similar File Type')
    for mhs in mmdt_hash_sim:
        tmp = mhs.split(',')
        ori_hash = tmp[0]
        sim_hash = tmp[1]
        mmdt_sim = float(tmp[2])
        ori_data = vhash_json[ori_hash]
        sim_data = vhash_json[sim_hash]
        ssdeep_sim = ssdeep_compare(ori_data, sim_data)
        tlsh_sim = tlsh_compare(ori_data, sim_data)
        vhash_sim = vhash_compare(ori_data, sim_data)
        ori_type = ori_data.get('magic', '').split(' ')[0]
        sim_type = sim_data.get('magic', '').split(' ')[0]
        print('%s,%s,%.3f,%.3f,%.3f,%.3f,%s,%s' % (
            ori_hash,sim_hash,mmdt_sim,ssdeep_sim,tlsh_sim,vhash_sim,ori_type,sim_type
        ))
if __name__ == '__main__':
    main()

The relevant files involved and download addresses:

① The integrated complete test data file ssdeep_tlsh_vhash_mmdthash_test.json download address:https://bbs.pediy.com/upload/attach/202201/467421_PP739ABEBEEEDUW._json

② mmdthash classification result file mmdt_sim.csv

Download address:https://bbs.pediy.com/upload/attach/202201/467421_JKYV8K9RE55G4V4._csv

③ Result file ssdeep_tlsh_vhash_mmdthash_test.xlsx

Download address:https://bbs.pediy.com/upload/attach/202201/467421_WTPBRUEPET5Q2MV.xlsx

Example data from ssdeep_tlsh_vhash_mmdthash_test.xlsx is as follows:

Comparison of ssdeep, tlsh, vhash, and mmdthash

3. Result Analysis

As mentioned earlier, we use the detection results of mmdthash as a baseline to compare the results of ssdeep, tlsh, and vhash.

Comparison of mmdthash Detection Results

The mmdthash similarity threshold is set at 0.95, and the results are sorted from largest to smallest. The first 133 files are detected as malicious files, of which 132 are correctly detected as malicious files, while the last one is incorrectly detected, with inconsistent classification results for the malicious family.

Comparison of mmdthash Undetected Results

The mmdthash similarity threshold is set at 0.95, and the results are sorted from largest to smallest. The last 267 files are undetected files, of which 200 are correctly detected as clean files, while 67 are incorrectly detected as clean files, predicting malicious files as clean files.

Comparison

Combining the two similarity distribution charts, ssdeep takes 0.8 as the judgment threshold:

In the 133 detected samples, ssdeep detected 109, and 24 were undetected.

In the 267 undetected samples, ssdeep detected 5, and 262 were undetected.

Similarly, tlsh takes 0.8 as the judgment threshold:

In the 133 detected samples, tlsh detected 131, and 2 were undetected.

In the 267 undetected samples, tlsh detected 22, and 245 were undetected.

Similarly, for vhash, since it only takes values of 0 and 1:

In the 133 detected samples, vhash detected 5, and 128 were undetected.

In the 267 undetected samples, tlsh detected 0, and 267 were undetected.

Through manual analysis of the outliers corresponding to the samples, with the mmdthash threshold set at 0.95, ssdeep at 0.8, and tlsh at 0.8, the following statistical data is obtained:

As shown in the figure:

Under the premise that the mmdthash threshold is set at 0.95, ssdeep at 0.8, and tlsh at 0.8, the following conclusions can be drawn:

tlsh has the highest accuracy, recall, and lowest false positive rate (false positive rate = 1.0 – accuracy)

mmdthash has the second highest accuracy and recall, with accuracy 1.5% lower than tlsh, recall 9.0% lower than tlsh, and false positive rate 5.5% lower than tlsh

ssdeep has the third highest accuracy and recall, with accuracy 6.8% lower than tlsh, recall 21.4% lower than tlsh, and false positive rate 5.5% lower than tlsh

vhash is special, and the data can be observed for comparison.

In summary, based on the test results of this article, the comprehensive consideration of sensitive hashing effectiveness is as follows:

tlsh > mmdthash > ssdeep > vhash

Others

1. Distribution of 400 Test File Types

PE files account for 96%

ELF files account for 2%

Other files account for 2%

2. Calculation Time for 400 Test Files

To reflect the differences, the tests were specifically executed on a low-performance Raspberry Pi, with the time taken as follows:

Reflections and Gains

① The ssdeep paper is really well written, easy to understand, with a clear argumentation process and very sufficient mathematical theory support, excellent.

② The tlsh ecosystem is really well developed, having published 6 related papers and participated in 6 conferences, which proves its application effectiveness.

③ In the process of implementing mmdthash, I was too closed off, and many immature ideas and thoughts were clearly and thoroughly discussed in the tlsh-related papers, and the applications are also very mature.

④ The performance of mmdthash needs to be continuously optimized.

Comparison of ssdeep, tlsh, vhash, and mmdthash

KX ID: 大大薇薇

https://bbs.pediy.com/user-home-467421.htm

*This article is original from KX forum 大大薇薇, please indicate the source when reprinting from KX community

# Previous Recommendations

1. malloc source code analysis

2. Windows local code execution vulnerability (CVE-2012-1876) analysis on x86/x64 platforms

3. Stack overflow principles and practical reading notes

4. Symbolic execution mining open-source library command injection

5. CVE-2021-4034 pkexec local privilege escalation vulnerability reproduction and principle analysis

6. Grand Theft Auto – Mastering Rolling Codes

Share

Like

Watching

Review of Previous Articles

python_mmdt: A Python Library for Generating Feature Vectors Based on Sensitive Hashing (Part 1)

(https://bbs.pediy.com/thread-265211.htm)

We introduced a method called mmdthash (sensitive hashing) and provided a basic introduction to its concepts.

python_mmdt: From 0 to 1 – Implementing a Simple Malware Classifier (Part 2)

(https://bbs.pediy.com/thread-265499.htm)

We introduced a simple malware classifier application based on mmdthash.

python_mmdt: From 1 to 2 – Implementing a KNN-based Machine Learning Malware Classifier (Part 3)

(https://bbs.pediy.com/thread-265860.htm)

We introduced a machine learning malware classifier application based on mmdthash.

python_mmdt: Online Use of mmdthash (Part 4)

(https://bbs.pediy.com/thread-271243.htm)

We introduced how to use mmdthash for online malicious file detection.

python_mmdt: KNN Machine Learning Classification Results Testing and Analysis (Part 5)

(https://bbs.pediy.com/thread-271265.htm)

We conducted statistical testing on the classification results of the KNN machine learning algorithm to evaluate the classification model.

This article compares the effectiveness of four types of sensitive hashing algorithms: ssdeep, tlsh, vhash, and mmdthash.

Project Address

GitHub code address: python_mmdt(https://github.com/a232319779/python_mmdt)

Comparison Conclusions

Accuracy ACC: tlsh > mmdthash > ssdeep > vhash

Recall REC: tlsh > mmdthash > ssdeep > vhash

Precision PRE: mmdthash = ssdeep = vhash > tlsh

Introduction to Sensitive Hashing

Comparison Approach

Comparison Process

1. ssdeep Calculation

2. tlsh Calculation

3. vhash Calculation

4. mmdthash Calculation

Result Comparison

1. Integration of Results from ssdeep, tlsh, vhash, and mmdthash

2. Sensitive Hash Similarity Calculation

3. Result Analysis

Others

1. Distribution of 400 Test File Types

PE files account for 96%

ELF files account for 2%

Other files account for 2%

2. Calculation Time for 400 Test Files

Reflections and Gains

① The ssdeep paper is really well written, easy to understand, with a clear argumentation process and very sufficient mathematical theory support, excellent.

② The tlsh ecosystem is really well developed, having published 6 related papers and participated in 6 conferences, which proves its application effectiveness.

③ In the process of implementing mmdthash, I was too closed off, and many immature ideas and thoughts were clearly and thoroughly discussed in the tlsh-related papers, and the applications are also very mature.

④ The performance of mmdthash needs to be continuously optimized.

Leave a Comment Cancel reply