Binary Code Similarity Detection Based on LSTM

Binary Code Similarity Detection Based on LSTM

This article is an excellent piece from the KX Forum, author ID: Flying Fish Oil

1

Introduction

In recent years, the rapid development of natural language processing has introduced a series of related algorithms and models. For example, RNN (Recurrent Neural Network) for processing sequential data, LSTM (Long Short-Term Memory Network), GRU (Gated Recurrent Unit), and models for calculating word embeddings such as word2vec, ELMo, and pre-trained BERT models.
In recent years, some papers have researched the application of these models and algorithms in binary code similarity analysis, enabling cross-platform binary code similarity detection. This article implements a simple model based on word2vec and LSTM to determine whether two functions or two instruction sequences are similar.

2

Overall Framework

Binary Code Similarity Detection Based on LSTM

3

Function Embedding

Binary Code Similarity Detection Based on LSTM
LSTM is a variant of RNN, designed to address the problem of gradient vanishing and long-term dependency issues inherent in RNNs. LSTM introduces gate structures—input gate, output gate, and forget gate—which can alleviate the gradient vanishing problem to some extent, enabling the learning of long-term dependencies. The structure of LSTM is as follows:Binary Code Similarity Detection Based on LSTM
The computation rules are as follows:Binary Code Similarity Detection Based on LSTMW and b are the parameters to be learned by LSTM; specific parameter details can be referenced in the official PyTorch documentation.

4

Instruction Embedding

The purpose of instruction embedding is to obtain a vectorized representation of instructions, facilitating calculations by LSTM and other models. Here, we implement the skip-gram model of word2vec. Word2vec is an open-source tool developed by Google for calculating word embeddings, which includes both cbow and skip-gram models. The specific implementation details of instruction embedding are as follows:
(1) Opcodes, registers, arithmetic symbols, and brackets are treated as individual words. For example, the instruction mov dowrd ptr [0x123456+eax*4], ebx can be tokenized into mov, dowrd, ptr, [, 0x123456, +, eax, *, 4, ], ebx. This instruction is treated as a sentence and fed into word2vec for training, resulting in a vectorized representation for each token.
(2) To reduce the size of the vocabulary, values exceeding 0x5000 in the operands are replaced with mem, disp, imm.
[0xXXXXXXXX] -> [mem][0xXXXXXXXX + index*scale + base] -> [disp + index*scale + base]0xXXXXXXXX -> imm
(3) The instruction vector consists of a vector corresponding to an opcode and two vectors corresponding to operands. If there are not enough operands, zero vectors are added for padding. For instructions with more than two operands, the vectors of the last two operands are summed and averaged. If an operand contains multiple tokens, the vectors of each token are summed and averaged to represent the current operand’s vector.

5

Code Implementation

The model’s code implementation uses the deep learning framework PyTorch, and the implementation of word2vec uses the gensim library. The parameters for calling word2vec are implemented in insn2vec.py as follows:
model = Word2Vec(tokensList, vector_size=wordDim, negative=15, window=5, min_count=1, workers=1, epochs=10, sg=1)   model.save('insn2vec.model')
tokensList contains a list of elements that store the tokenized sequences of an instruction. After training, word2vec is saved to the insn2vec.model file for further fine-tuning.
The implementation of instruction embedding is in the lstm.py file, as follows:
class instruction2vec(nn.Module):    def __init__(self, word2vec_model_path:str):        super(instruction2vec, self).__init__()        word2vec = Word2Vec.load(word2vec_model_path)        self.embedding = nn.Embedding.from_pretrained(torch.from_numpy(word2vec.wv.vectors))        self.token_size = word2vec.wv.vector_size# Dimension size        self.key_to_index = word2vec.wv.key_to_index.copy()    # dict        self.index_to_key = word2vec.wv.index_to_key.copy()    # list        del word2vec     def keylist_to_tensor(self, keyList:list):        indexList = [self.key_to_index[token] for token in keyList]        return self.embedding(torch.LongTensor(indexList))     def InsnStr2Tensor(self, insnStr:str) -> torch.tensor:        insnStr = RefineAsmCode(insnStr)        tokenList = re.findall('\w+|[\+\-\*:\[\]\,]', insnStr)        opcode_tensor =  self.keylist_to_tensor(tokenList[0:1])[0]        op_zero_tensor = torch.zeros(self.token_size)        insn_tensor = None        if(1 == len(tokenList)):            # No operands            insn_tensor = torch.cat((opcode_tensor, op_zero_tensor, op_zero_tensor), dim=0)        else:            op_token_list = tokenList[1:]            if(op_token_list.count(',') == 0):                # One operand                op1_tensor = self.keylist_to_tensor(op_token_list)                insn_tensor = torch.cat((opcode_tensor, op1_tensor.mean(dim=0), op_zero_tensor), dim=0)#tensor.mean求均值后变成一维             elif(op_token_list.count(',') == 1):                # Two operands                dot_index = op_token_list.index(',')                op1_tensor = self.keylist_to_tensor(op_token_list[0:dot_index])                op2_tensor = self.keylist_to_tensor(op_token_list[dot_index+1:])                insn_tensor = torch.cat((opcode_tensor, op1_tensor.mean(dim=0), op2_tensor.mean(dim=0)), dim=0)             elif(op_token_list.count(',') == 2):                # Three operands                dot1_index = op_token_list.index(',')                dot2_index = op_token_list.index(',', dot1_index+1)                op1_tensor = self.keylist_to_tensor(op_token_list[0:dot1_index])                op2_tensor = self.keylist_to_tensor(op_token_list[dot1_index+1:dot2_index])                op3_tensor = self.keylist_to_tensor(op_token_list[dot2_index+1:])                 op2_tensor = (op2_tensor.mean(dim=0) + op3_tensor.mean(dim=0)) / 2                insn_tensor = torch.cat((opcode_tensor, op1_tensor.mean(dim=0), op2_tensor), dim=0)         if(None == insn_tensor):            print("error: None == insn_tensor")            raise         insn_size = insn_tensor.shape[0]        if(self.token_size * 3 != insn_size):            print("error: (token_size)%d != %d(insn_size)" % (self.token_size, insn_size))            raise         return insn_tensor  #[len(tokenList), token_size]     def forward(self, insnStrList:list) -> torch.tensor:        insnTensorList = [self.InsnStr2Tensor(insnStr) for insnStr in insnStrList]        return torch.stack(insnTensorList) #[insn_count, token_size]
The purpose of the instruction2vec class is to perform instruction embedding. The token_size represents the dimension size of the tokens, and the instruction dimension size is token_size*3. Initially, it mainly loads the pre-trained word vectors from word2vec.wv.vectors, facilitating the conversion of string-form instructions to vectors in InsnStr2Tensor.
The code implementation for function embedding is as follows:
class SiameseNet(nn.Module):    def __init__(self, hidden_size=60, n_layers=2, bidirectional = False):        super(SiameseNet, self).__init__()        self.insn_embedding = instruction2vec("./insn2vec.model")        input_size = self.insn_embedding.token_size * 3        # input_size is the dimension of the instruction, hidden_size is the dimension of the entire instruction sequence        self.lstm = nn.LSTM(input_size, hidden_size, n_layers, batch_first=True, bidirectional = bidirectional)         self.D = int(bidirectional)+1         self.w_omega = nn.Parameter(torch.Tensor(hidden_size * self.D, hidden_size * self.D))        self.b_omega = nn.Parameter(torch.Tensor(hidden_size * self.D))        self.u_omega = nn.Parameter(torch.Tensor(hidden_size * self.D, 1))         nn.init.uniform_(self.w_omega, -0.1, 0.1)        nn.init.uniform_(self.u_omega, -0.1, 0.1)     def attention_score(self, x):        # x:[batch_size, seq_len, hidden_size*D]        u = torch.tanh(torch.matmul(x, self.w_omega))        # u:[batch_size, seq_len, hidden_size*D]        att = torch.matmul(u, self.u_omega)        # att:[batch_size, seq_len, 1]        att_score = F.softmax(att, dim=1)# Get the hidden weights for each step        # att_score:[batch_size, seq_len, 1]        scored_x = x*att_score  # Similar to matrix multiplication        return torch.sum(scored_x, dim=1)# Weighted summation     def forward_once(self, input:list) -> torch.tensor:        lengths = []# Record the length of each instruction sequence        out = []        for insnStrList in input:            insnVecTensor = self.insn_embedding(insnStrList)# Convert instructions to vectors            out.append(insnVecTensor)            lengths.append(len(insnStrList))         pad_out = pad_sequence(out, batch_first=True)# Pad zeros to make all handlers have the same seq_len        pack_padded_out = pack_padded_sequence(pad_out, lengths, batch_first=True, enforce_sorted=False)        packed_out,(hn,_) = self.lstm(pack_padded_out)# Input shape:[batch_size, seq_len, input_size]        # hn:[D*num_layers,batch_size,hidden_size]        # out:[batch_size, seq_len, hidden_size*D], at this point out has some zero padding        out,lengths = pad_packed_sequence(packed_out, batch_first=True)        out = self.attention_score(out)        return out     def forward(self, input1, input2):        out1 = self.forward_once(input1)# out1:[batch_size,hidden_size]        out2 = self.forward_once(input2)        out = F.cosine_similarity(out1, out2, dim=1)        return out
Since the input for function embedding is a pair of functions, this model is also a Siamese neural network with shared parameters. The hidden_size represents the dimension size of the function, set to 60 dimensions here. The attention_score corresponds to the attention mechanism, where w_omega is the W matrix and u_omega is the U matrix. The input type for PyTorch’s LSTM is a tensor of shape [batch_size, seq_len, input_size], which corresponds to a matrix of batch_size*seq_len*input_size, where batch_size is the number of functions and seq_len is the number of instructions.
Although LSTM can handle sequences of arbitrary length, to speed up computations, PyTorch’s LSTM requires seq_len to be the same, necessitating the addition of zero vectors for alignment. The addition of zero vectors may have some impact on the entire model. After the dot product of W and H, i.e., torch.tanh(torch.matmul(x, self.w_omega)), a special handling is needed to push these added zero vectors to negative infinity, ensuring that during the attention mechanism’s softmax operation, the weights corresponding to these vectors approach zero, meaning that attention should not focus on these zero vectors. This handling has not been added; those interested can modify it themselves.

6

Model Evaluation

The vector dimension of instructions is 30, and the vector dimension of functions is 60. The dataset uses six binary files:
ntdll_7600_x64.dllntoskrnl_7600_x64.exewin32kfull_17134_x64.sysntdll_7600_x32.dllntoskrnl_7600_x32.exewin32kfull_17134_x32.sys
Three files for x86 and three for x64, using angr to extract a total of 4437 functions, where functions with the same name in x86 and x64 constitute a similar positive example. A randomly selected differently named x64 function from x86 functions constructs a negative example, or vice versa. The positive and negative sample ratio in the dataset is 1:1, with a total sample size of 4437*2, further divided into training, validation, and test sets in an 8:1:1 ratio. In fact, positive and negative samples should also include some x86 corresponding to x86 or x64 corresponding to x64 to make the data distribution more uniform. The learning rate for stochastic gradient descent is set to 0.09, batch size is 8, and the number of iterations is 50.
Binary Code Similarity Detection Based on LSTM
Binary Code Similarity Detection Based on LSTMBinary Code Similarity Detection Based on LSTMBinary Code Similarity Detection Based on LSTMBinary Code Similarity Detection Based on LSTMThe first two images show the loss of the training and validation sets decreasing with the number of iterations. It can be seen that the validation set loss has already converged without significant decrease, even when the training set loss has not yet converged. If training continues, it is likely to overfit, primarily due to insufficient data.
The third image is the ROC curve, with an AUC (Area Under Curve) of 0.91; the closer the AUC is to 1, the better. According to the ROC curve, when the threshold for cosine similarity between two functions is set to 0.24, the accuracy for both the validation and test sets can reach 84%.

7

Conclusion

The primary reason for the model’s accuracy not reaching over 90% is still the insufficient amount of data; the core of deep learning lies in having sufficient and effective data. Secondly, word2vec cannot handle the issue of polysemy; the instruction embedding based on word2vec computes the same vectorized representation for identical instructions, even if they have different contexts. For example, the following two instruction sequences:
mov eax, [0x12345678]        add eax, [0x12345678]shl eax, 2                   shl eax, 2ret                          inc eax
The two shl eax, 2 instructions have different contexts and should have different vectorized representations. Finally, the processing of functions should be based on basic blocks, as the instruction sequences within functions are influenced by control flow and are not solely executed in sequence; the vectorized representation of control flow graphs could introduce a graph neural network for extraction.

References

SAFE: Self-Attentive Function Embeddings for Binary Similarity; A simple function embedding approach for binary similarity detection; Understanding LSTM Networks

Binary Code Similarity Detection Based on LSTM

KX ID: Flying Fish Oil

https://bbs.pediy.com/user-home-742617.htm

* This article is original by the KX Forum, Flying Fish Oil, please indicate the source from KX Community when reprinting
Binary Code Similarity Detection Based on LSTM

# Previous Recommendations

1. The House of Mind

2. CVE-2012-1889 Vulnerability Analysis and Exploitation Notes

3. AI Competition – Target Recognition Guide

4. Android Packing and Unpacking Learning – Detailed Explanation of Dynamic Loading and Class Loading Mechanisms

5. CVE-2021-26411 Vulnerability Analysis Notes

6. Analysis of Windows Local Privilege Escalation Vulnerability CVE-2014-1767 and EXP Writing Guidance

Binary Code Similarity Detection Based on LSTM
Binary Code Similarity Detection Based on LSTM

Share

Binary Code Similarity Detection Based on LSTM

Like

Binary Code Similarity Detection Based on LSTM

Watching

Leave a Comment