[See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221566041](https://www.researchgate.net/publication/221566041_Idea_Opcode-Sequence-Based_Malware_Detection?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_2&_esc=publicationCoverPdf) ## Idea: Opcode-Sequence-Based Malware Detection **Conference Paper · February 2010** DOI: 10.1007/978-3-642-11747-3_3 · Source: DBLP CITATIONS 170 **7 authors, including:** [Igor Santos](https://www.researchgate.net/profile/Igor-Santos-22?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_5&_esc=publicationCoverPdf) [University of Deusto](https://www.researchgate.net/institution/University-of-Deusto?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_6&_esc=publicationCoverPdf) **114** PUBLICATIONS **2,451** CITATIONS READS 2,568 [Felix Brezo](https://www.researchgate.net/profile/Felix-Brezo?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_5&_esc=publicationCoverPdf) [University of Deusto](https://www.researchgate.net/institution/University-of-Deusto?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_6&_esc=publicationCoverPdf) **17** PUBLICATIONS **799** CITATIONS [SEE PROFILE](https://www.researchgate.net/profile/Igor-Santos-22?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_7&_esc=publicationCoverPdf) [SEE PROFILE](https://www.researchgate.net/profile/Felix-Brezo?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_7&_esc=publicationCoverPdf) [Borja Sanz](https://www.researchgate.net/profile/Borja-Sanz-3?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_5&_esc=publicationCoverPdf) [University of Deusto](https://www.researchgate.net/institution/University-of-Deusto?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_6&_esc=publicationCoverPdf) **59** PUBLICATIONS **931** CITATIONS [Carlos Laorden](https://www.researchgate.net/profile/Carlos-Laorden?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_5&_esc=publicationCoverPdf) [University of Deusto](https://www.researchgate.net/institution/University-of-Deusto?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_6&_esc=publicationCoverPdf) **47** PUBLICATIONS **1,005** CITATIONS [SEE PROFILE](https://www.researchgate.net/profile/Borja-Sanz-3?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_7&_esc=publicationCoverPdf) [SEE PROFILE](https://www.researchgate.net/profile/Carlos-Laorden?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_7&_esc=publicationCoverPdf) **Some of the authors of this publication are also working on these related projects:** Supervised Machine Learning for the Detection of Troll Profiles in Twitter Social Network: Application to a Real Case of Cyberbullying [View project](https://www.researchgate.net/project/Supervised-Machine-Learning-for-the-Detection-of-Troll-Profiles-in-Twitter-Social-Network-Application-to-a-Real-Case-of-Cyberbullying?enrichId=rgreq-51fcf54d5ac23f8269b221ca5e2a2e18-XXX&enrichSource=Y292ZXJQYWdlOzIyMTU2NjA0MTtBUzoxMDY1MjUzNDMzNTQ4ODNAMTQwMjQwOTAyOTM5Mg%3D%3D&el=1_x_9&_esc=publicationCoverPdf) ----- # Idea: Opcode-sequence-based Malware Detection Igor Santos[1], Felix Brezo[1], Javier Nieves[1], Yoseba K. Penya[2], Borja Sanz[1], Carlos Laorden[1] and Pablo G. Bringas[1] 1S3Lab, 2eNergy Lab University of Deusto Bilbao, Spain ``` {isantos,felix.brezo,javier.nieves,yoseba.penya,borja. sanz,claorden,pablo.garcia.bringas}@deusto.es ``` **Abstract. Malware is every malicious code that has the potential to** harm any computer or network. The amount of malware is increasing faster every year and poses a serious security threat. Hence, malware detection has become a critical topic in computer security. Currently, signature-based detection is the most extended method within commercial antivirus. Although this method is still used on most popular commercial computer antivirus software, it can only achieve detection once the virus has already caused damage and it is registered. Therefore, it fails to detect new variations of known malware. In this paper, we propose a new method to detect variants of known malware families. This method is based on the frequency of appearance of opcode sequences. Furthermore, we describe a method to mine the relevance of each opcode and, thereby, weigh each opcode sequence frequency. We show that this method provides an effective way to detect variants of known malware families. **Key words: malware detection, computer security, machine learning** ### 1 Introduction Malware (or malicious software) is every computer software that has harmful intentions, such as viruses, Trojan horses, spyware or Internet worms. The amount, power and variety of malware increases every year as well as its ability to avoid all kind of security barriers [1] due to, among other reasons, the growth of Internet. Furthermore, malware writers use code obfuscation techniques to disguise an already known security threat from classic syntactic malware detectors. These facts have led to a situation in which malware writers develop new viruses and different ways for hiding their code, while researchers design new tools and strategies to detect them [2]. Generally, the classic method to detect malware relies on a signature database [3] (i.e. list of signatures). An example of a signature is a sequence of bytes that is always present in a concrete malware file and within the files already infected by that malware. In order to determine a file signature for a new malware executable and to finally find a proper solution for it, specialists have to wait until that ----- new malware instance has damaged several computers or networks. In this way, malware is detected by comparing its bytes with that list of signatures. When a match is found the tested file will be identified as the malware instance it matches with. This approach has proved to be effective when the threats are known in beforehand, and it is the most extended solution within antivirus software. Still, upon a new malware appearance and until the corresponding file signature is obtained, mutations (i.e. aforementioned obfuscated variants) of the original malware may be released in the meanwhile. Therefore, already mentioned classic signature-based malware detectors fail to detect those new variants [2]. Against this background we advance the state of art in two main ways. First, we address here a new method that is able to mine the relevance of an opcode (operational code) for detecting malicious behaviour. Specifically, we compute the frequency with which the opcode appears in a collection of malware and in a collection of benign software and, hereafter, we calculate a discrimination ratio based on statistics. In this way, we finally acquire a weight for each opcode. Second, we propose a new method to compute similarity between two executable files that relies on opcode sequence frequency. We weigh this opcode sequence frequency with the obtained opcode relevance to balance each sequence in the way how discriminant the composing opcodes are. ### 2 Mining opcode relevance Opcodes (or operational codes) can act as a predictor for detecting obfuscated or metamorphic malware [4]. Some of the opcodes (i.e. mov or push), however, have a high frequency of appearance within malware and benign executables, therefore the resultant similarity degree (if based on opcode frequency) between two files can be somehow distorted. Hence, we propose a way to avoid this phenomenon and to give each opcode the relevance that it really has. In this way, we have collected malware from the VxHeavens website [5] forming a malware dataset of 13189 malware executables. This dataset contains only PE executable files, and, more accurately, it is made up of different kind of malicious software (e.g. computer viruses, Trojan horses, spyware, etc ). For the benign software dataset, we have collected 13000 executables from our computers. This benign dataset includes, for instance, word-processors, drawing tools, windows games, internet browsers, pdf viewers and so on. We accomplish the following steps for computing the relevance of each opcode. First, we disassemble the executables. In this step, we have used The New_Basic Assembler [6] as the main tool for obtaining the assembly files. Second,_ using the generated assembly files, we have built an opcode profile file. Specifically, this file contains a list with the operational code and the un-normalized frequency within both datasets (i.e. benign software dataset and malicious software dataset). Finally, we compute the relevance of each opcode based on the frequency with which it appears in both datasets. To this extent, we use Mutual � _p(x,y)_ � _Information [7], I(X; Y ) =_ [�]yϵY �xϵX _[p][(][x, y][) log]_ _p(x)·p(y)_ . Mutual informa tion is a measure that indicates how statistically dependant two variables are. ----- In our particular case, we define the two variables as each opcode frequency and whether the instance is malware. In this way, X is the opcode frequency and _Y is the class of the file (i.e. malware or benign software), p(x, y) is the joint_ probability distribution function of X and Y, p(x) and p(y) are the marginal probability distribution functions of X and Y . Furthermore, once we computed the mutual information between each opcode and the executable class (malware of benign software) and we sorted them, we created an opcode relevance file. Thereby, this list of opcode relevance can help us to achieve a more accurate detection of malware variations since we are able to weigh the similarity function using these calculated opcode relevance and reducing the noise that irrelevant opcodes can produce. ### 3 Malware detection method In order to detect both malware variants we extract the opcode sequences and their frequency of appearance. More accurately, we define a program ρ as a sequence of instructions I where ρ = (I1, I2, ..., In−1, In). An instruction is composed by an operational code (opcode) and a parameter or list of parameters. In this way, we assume that a program is made up of opcodes. These opcodes can gather into several blocks that we call opcode sequences. More accurately, we assume a program ρ as a set of ordered opcodes o, _ρ = (o1, o2, o3, o4, ..., on−1, on), where n is the number of instructions I of the_ program ρ. A subgroup of opcodes is defined as an opcode sequence os where _os ⊆_ _ρ, and it is made up of opcodes o, os = (o1, o2, o3, ..., om−1, om), where m_ is the length of the sequence of opcodes os. First of all, we choose the length of opcode sequences. Afterwards, we compute the frequency of appearance of each opcode sequence. Specifically, we use _term frequency [8], tfi,j =_ �nki,j[n][k,j][, that is a weight widely used in information] retrieval. More accurately, ni,j is the number of times the term ti,j (in our case opcode sequence) appears in a document d, and [�]k _[n][k,j][ is the total number]_ of terms in the document d (in our case the total number of possible opcode sequences). Further, we compute this measure for every possible opcode sequence of a fixed length n, acquiring by doing so, a vector _[−→]v made up of frequencies_ of opcode sequences S = (o1, o2, o3, ..., on−1, on). We weigh the frequency of appearance of this opcode sequence using the weights described in section 2. To this extent, we define weighted term frequency (wtf ) as the result of weighting the relevance of each opcode when calculating the term frequency. Specifically, we compute it as the result of multiplying term frequency by the calculated weight of every opcode in the sequence. In this way, weight(o) is the calculated weight for the opcode o and tfi,j is the term frequency measure for the given opcode sequence, wtfi,j = tfi,j · _oϵS_ _weight100_ (o) . Once we have calculated the weighted [�] _term frequency, we have the vector of weighted opcode sequence frequencies._ _[−→]v =_ (wtf1, wtf2, wtf3, ..., wtfn−1, wtfn). ----- We have focused on detecting known malware variants in this method. In this way, what we want to provide is a similarity measure between two files. Once we extract the opcode sequences that will act as features, we have a proper representation of the files as two input vectors _[−→]v and_ _[−→]u of opcode sequences. Hereafter,_ we can calculate a similarity measure between those two vectors. In this way, we use cosine similarity, sim([−→]v, _[−→]u ) = cos (θ) =_ _−→v ·−→u_ _||[−→]v ||·||[−→]u ||_ [[9]. Therefore, we think] that this measure will give a high result when two versions of the same malware instance are compared. **Fig. 1. A histogram showing the obtained results in terms of frequency of similarity** ratio for the comparison of malware instances with their variants for n=1 **Fig. 2. A histogram showing the obtained results in terms of frequency of similarity** ratio for the comparison of malware instances with benign executables for n=1 ----- **Fig. 3. A histogram showing the obtained results in terms of frequency of similarity** ratio for the comparison of malware instances with their variants for n=2 **Fig. 4. A histogram showing the obtained results in terms of frequency of similarity** ratio for the comparison of malware instances with benign executables for n=2 ### 4 Experimental Results For the following experiment, we have used two different datasets for testing the system: a malware dataset and a benign software one. First, we downloaded a big malware collection from VxHeavens [5] website conformed by different malicious code such as trojan horses, virus or worms. Specifically, we have used the next malware families: Agobot, Bifrose, Kelvir, Netsky, Opanki and Protoride. We have extracted the opcode sequences of a fixed length (n) with n = 1 and _n = 2 for each malware and some of its variants. Moreover, we have followed the_ same procedure for the benign software dataset. Hereafter, we have computed the cosine similarity between each malware and its set of variants. Further, we have computed the similarity of the malware instance with the whole benign software dataset. ----- **Fig. 5. A histogram showing the obtained results in terms of frequency of similarity** ratio for the comparison of malware instances with their variants for n=1 and n=2 combined Specifically, we have performed this process for every malware executable file within the dataset. For each malware family, we have randomly chosen one of its variants as the known instance and we have computed the cosine similarity between this variant and the other variants of that specific malware family. Moreover, we have performed the same procedure with a set of benign software in order to test the appearance of false positives. Fig. 1 shows the obtained results of the comparison of malware families and their variants for an opcode sequence length n of 1. In this way, nearly every malware variant achieved a similarity degree between 90% and 100%. Still, the results obtained when comparing with the benign dataset (see Fig. 2) show that the similarity degree is too high, thus, this opcode sequence length seems to be not appropriate. For an opcode sequence length of 2, the obtained results in terms of malware variant detection(see Fig. 3) show that the similarity degrees are more distributed in frequencies, however, the majority of the variants achieved a relatively high results. In addition, Fig. 4 shows the obtained results for the benign dataset. In this way, the results are better for this opcode sequence length, being more frequent the low similarity ratios. Summarizing, on one hand, for the obtained results in terms of similarity degree for malware variant detection, the most frequent similarity degree is in the 90-100% interval. Moreover, the similarity degree frequency decreases and so the frequency does. Therefore, this method will be able to detect reliably a high number of malware variants after selecting the appropriate threshold of similarity ratio for declaring an executable as malware variant. Nevertheless, some of the executables were packed and, thereby, there are several malware variants that when computing the similarity degree did not achieve a high similarity degree. Still, the similarity degrees between the two kind of sets (i.e. malware variants and benign software) are not different enough. Therefore, we decided to perform ----- **Fig. 6. A histogram showing the obtained results in terms of frequency of similarity** ratio for the comparison of malware instances with benign executables for n=1 and n=2 combined another experiment where the different opcode sequence lengths are combined (n = 1 and n = 2). Figures 5 and 6 show the obtained results. In this way, the malware variant similarity degrees remained quite high whilst the benign similarity degrees scrolled to lower results. On the other hand, for the obtained results in terms of similarity degree when comparing the malware instances with the benign dataset, as one may think in beforehand, the behaviour of the system yields to be nearly the opposite than when comparing it with its variants. In this way, the system achieved low similarity degrees in the majority of the cases of the benign dataset. Hence, if we select a threshold that allows us to detect the most number of malware variants as possible whilst the number of false positives is kept to 0, our method renders as a very useful tool to detect malware variants. ### 5 Related Work There has been a great concern regarding malware detection in the last years. Generally, we can classify malware detection in static or dynamic approaches. Static detectors obtain features for further analysis without executing them since dynamic detectors execute malware in a contained environment. In this way, static analysis for malware detection can be focused on the binary executables [10] or in source code [11] like the method proposed in this paper. With regard to the binary analysis of the executables, there has been an hectic activity around the use of machine-learning techniques over byte-sequences. The first attempt of using non-overlapping sequence of bytes of a given length n as features to train a machine-learning classifier was proposed by Schulz et al. [12]. In that approach the authors proposed a method using the printable ASCII strings of the binary, tri-grams of bytes, the list of imported dynamically linked libraries (DLL), the list of DLL functions imported and the number of functions for each DLL. They applied multiple learning algorithms showing that multi ----- Na¨ıve Bayes perform the best. Kolter et al. [13] improved the results obtained by Schulz et al. using n-grams (overlapping byte sequences) instead of nonoverlapping sequences. Their method used several algorithms and the best results were achieved by a boosted decision tree. In a similar vein, a lot of work has been made over n-gram distributions of byte sequences and machine-learning [14]. Still, most of the features they used for the training of the classifiers can be changed easily by simply changing the compiler since they focus on byte distributions. Moreover, several approaches have been based in the so-called Control Flow _Graph Analysis. In this way, it is worth to mention the work of Christodescu_ and Jha [2] that proposed a method based of Control Flow Analysis to handle obfuscations in malicious software. Lately, Christodescu et. at. [15] improved the previous work including semantic-templates of malicious specifications. Nevertheless, the time resources they consume render them as not already full prepared to be adopted for antivirus vendors, although Control Flow Analysis techniques have proved to obtain some very valuable information of malicious behaviours. Dynamic analysis for malware detection, as aforementioned, runs a program in a contained environment and collects information about it. Despite they are limited by one execution flow, they can overcome the main issue of static analysis: be sure that the code that will be executed is the one that is being analysed [16]. Therefore, these methods do not have to face obfuscations or in-memory mutation [17]. In this way, the safe environment can be based on a virtual machine [18] or based on DLL Injection and API Hooking [19]. ### 6 Conclusions and future work Malware detection has risen to become a topic of research and concern due to its increasing growth in past years. The classic signature methods that antivirus vendors have been using are no longer effective since the increasing number of new malware renders them unuseful. Therefore, this technique has to be complemented with more complex methods that provide detection of malware variants, in an effort of detecting more malware instances with a single signature. In this paper, we proposed a method detecting malware variants that relied in opcodes sequences in order to construct a vector representation of the executables. In this way, based upon some length sequences, the system was able to detect the malicious behaviour of malware variants. Specifically, experiments have shown the following abilities of the system. First, the system was able to identify malware variants. Second, it was able to distinguish benign executables. The future development of this malware detection system is oriented in three main directions. First, we will focus on facing packed executables using a hybrid dynamic-static approach. Second, we will expand the used features using even longer sequences and more information like system calls. Finally, we will perform experiments with a larger malware dataset. ----- ### References 1. Karsperky-Labs: Kaspersky Security Bulletin: Statistics 2008 (2009) 2. Christodorescu, M., Jha, S.: Static analysis of executables to detect malicious patterns. In: Proceedings of the 12th USENIX Security Symposium. (February 2003) 169–186 3. Morley, P.: Processing virus collections. In: Proceedings of the 2001 Virus Bulletin Conference (VB2001), Virus Bulletin (2001) 129–134 4. Bilar, D.: Opcodes as predictor for malware. International Journal of Electronic Security and Digital Forensics 1(2) (2007) 156–168 5. VX heavens (2009) url: http://vx.netlux.org/ Last Accessed: Last accessed: September 29th, 2009. 6. NewBasic - An x86 Assembler/Disassembler for DOS http://www.frontiernet.net/ fys/newbasic.htm Last accessed: September 29th, 2009. 7. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence (2005) 1226–1238 8. McGill, M., Salton, G.: Introduction to modern information retrieval. McGraw-Hill (1983) 9. Tata, S., Patel, J.: Estimating the Selectivity of tf-idf based Cosine Similarity Predicates. SIGMOD Record 36(2) (2007) 75–80 10. Carrera, E., Erd´elyi, G.: Digital genome mapping–advanced binary malware analysis. In: Virus Bulletin Conference. (2004) 187–197 11. Ashcraft, K., Engler, D.: Using programmer-written compiler extensions to catch security holes. In: Proceedings of the 23[r]d IEEE Symposium on Security and Privacy. (2002) 143–159 12. Schultz, M., Eskin, E., Zadok, F., Stolfo, S.: Data mining methods for detection of new malicious executables. In: Proceedings of the 22[t]h IEEE Symposium on Security and Privacy. (2001) 38–49 13. Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executables in the wild. In: Proceedings of the 10th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), New York, NY, USA, ACM (2004) 470–478 14. Santos, I., Penya, Y., Devesa, J., Bringas, P.: N-Grams-based file signatures for malware detection. In: Proceedings of the 11[th] International Conference on Enterprise Information Systems (ICEIS), Volume AIDSS. (2009) 317–320 15. Christodorescu, M., Jha, S., Seshia, S., Song, D., Bryant, R.: Semantics-aware malware detection. In: Proceedings of the 2005 IEEE Symposium on Security and Privacy. (2005) 32–46 16. Cavallaro, L., Saxena, P., Sekar, R.: On the limits of information flow techniques for malware analysis and containment. Lecture Notes in Computer Science 5137 (2008) 143–163 17. Bayer, U., Moser, A., Kruegel, C., Kirda, E.: Dynamic analysis of malicious code. Journal in Computer Virology 2(1) (2006) 67–77 18. King, S., Chen, P.: SubVirt: Implementing malware with virtual machines. In: 2006 IEEE Symposium on Security and Privacy. (2006) 314–327 19. Willems, C., Holz, T., Freiling, F.: Toward automated dynamic malware analysis using cwsandbox. IEEE Security & Privacy 5(2) (2007) 32–39 -----