# DIGITDIGITDIGITAL GENOME MAPPING –DIGITDIGITAL GENOME MAPPING –AL GENOME MAPPING –AL GENOME MAPPING –AL GENOME MAPPING – ADVADVADVANCED BINARADVADVANCED BINARANCED BINARANCED BINARY MALANCED BINARY MALY MALY MALWY MALWWWAREWAREAREAREARE ANALANALANALYSISANALANALYSISYSISYSISYSIS Ero Carrera and Gergely Erdélyi F-Secure Corporation Anti-Virus Research Team, Tammasaarenkatu 7, 00181, Helsinki, Finland Tel +358 9 2520 0700 • Fax +358 9 2520 5001• Email ero.carrera@f-secure.com • gergely.erdelyi@f-secure.com ## ABSTRACTABSTRACTABSTRACTABSTRACTABSTRACT Windows binary malware has come a long way. Today’s average worm is often tens or hundreds of kilobytes of code exhibiting a level of complexity that surpasses even some operating systems. This degree of complexity, coupled with the overwhelming flow of new malware, calls for improvements to tools and techniques used in analysis. Our paper elaborates on how to use graph theory to aid the analysis. Using graphs and extensions with the popular Interactive Disassembler Pro package, we hope to reduce the time needed to understand the structure of complex malware. These methods have proven to be helpful in finding similarities and differences between different malware variants and strains. Focusing on the differences by keeping off already known code allows rapid analysis and classification of malware, while reducing redundant efforts. ## 1. INTRODUCTION1. INTRODUCTION1. INTRODUCTION1. INTRODUCTION1. INTRODUCTION Nowadays, reverse engineers – especially those within the anti-virus and forensics industries – face the challenge of identifying and analysing a large number of binaries appearing at an incredible rate. Some of the difficulties analysts face are: - The sheer number of samples to check. - The amount of code to analyse on each of them. - Recognizing variants and versions of families of malware and other software. Most of those problems can be solved with proper automation and algorithms. Ideas and results from prototypes will be presented in this paper. We are introducing a new platform based on Interactive Disassembler Pro and the Python programming language. The aim is to aid development of research prototypes and advanced analysis. The resulting tools will be targeted to recognize similar features in the structure of executables. The theoretical base we rely on to achieve our objective will be graph theory, by looking at representations of code in the form of graphs. With them, the necessary level of abstraction is achieved and general features in the structure of a binary become distinguishable. ## 2. PROGRAMMING IDA PRO2. PROGRAMMING IDA PRO2. PROGRAMMING IDA PRO2. PROGRAMMING IDA PRO2. PROGRAMMING IDA PRO People who perform analysis of binary code tend to collect and use a large arsenal of different tools. Apart from a good hex editor, probably the most commonly used tool is Interactive Disassembler Pro (IDA) [1]. IDA has earned its reputation by being extremely versatile, flexible and extensible. The already remarkable set of features IDA has can be further expanded with the built-in script language and compiled plugins. In the next section we will take a brief look at both extension methods. ## 2.1 I2.1 I2.1 IDC2.1 I2.1 IDCDCDCDC The easy way to get started with IDA programming is IDC, the scripting language of IDA. The syntax of IDC is very similar to ANSI C, which makes it familiar for experienced C programmers. Unlike C, IDC is an interpreted language which saves us from the tedious write-compile-fix development cycle of C. IDC comes with an extensive set of built-in functions that cover many aspects of the disassembly process as well as the structure and data of the program being disassembled. The semantics of all those functions are, not too surprisingly, close to regular C functions. The following simple program is an example of IDC code. It is a trivial decryptor that uses XOR 0xFF to decrypt a zero-terminated ASCII string, starting from the current cursor position. #include static main () { auto baseaddr; auto key; auto c; baseaddr = ScreenEA(); key = 0xFF; while (1) { c = Byte(baseaddr); if (c == 0) { break; } PatchByte(baseaddr, c ^ key); baseaddr++; } } ## 2.2 Plugins2.2 Plugins2.2 Plugins2.2 Plugins2.2 Plugins In case the required code exceeds the capabilities of IDC we can resort to writing an IDA plugin. The SDK comes with IDA and defines an extensive API for plugins. Unlike IDC the API gives access even to the most intimate details of the IDA database and all other aspects of the disassembly. IDA plugins can be incredibly powerful with this level of control. Plugins are developed in C++ and can be compiled with Borland, Microsoft, Watcom or GNU C++ compilers. ----- Since the code is compiled to native binary, the execution speed is similar to any other native application. The main disadvantage also comes from the need of compilation which makes the development tedious. Another problem is that a small bug can crash the whole IDA application, possibly corrupting the database. The following example shows the plugin implementation of the XOR decryptor with all the plugin administration code stripped. void decode (void) { ea_t baseaddr; uchar key; uchar c; baseaddr = get_screen_ea(); key = 0xFF; while (1) { c = get_byte(baseaddr); if (c == 0) { break; } patch_byte(baseaddr, c ^ key); baseaddr++; } } The look of the code did not change much. The IDA API functions have somewhat longer names and more specific purpose than the IDC counterparts. To implement certain IDC functions, often several IDA API calls and data conversions have to be made. ## 3. NEED FOR CHANGE?3. NEED FOR CHANGE?3. NEED FOR CHANGE?3. NEED FOR CHANGE?3. NEED FOR CHANGE? What is wrong with IDC and plugin writing? Generally speaking nothing. Either of them would be suitable to implement anything that can be implemented with the help of IDA. On the other hand there are other points to consider here. The undisputed dominance and popularity of C and C++ does not automatically mean that they are the optimum choice for every task. The lack of complex, abstract data structures make C unsuitable for rapid development and prototyping. Without trying to start a language debate we can also note that the extensions of C++ over C do not have the easiest to remember syntax either. C and C++ put efficiency over expressiveness, which certainly made sense when computing power was more expensive than programmers’ time. However with computing horsepower getting cheaper and cheaper it does not sound so crucial any more to spare clock cycles at the expense of human productivity. What would be a better tool then? Before jumping to conclusions it is necessary to examine the most common practical applications of IDA extensions. As far as we can tell the two most common are: - experimental reverse engineering tools - quick tools for malware analysis. Experimental tools evolve quickly, which clearly benefit from a development environment that supports frequent code and interface changes. The programming language and the tools must not limit the researcher’s ability to significantly change the code whenever it is needed. Code written for malware analysis purposes usually solves a single problem and, typically, it is not reused. Decryptors written for different malware’s string scramblers are a good example of this group. According to our experience these tools cannot be fully reused too often as malware changes very rapidly, even in the same family. In both cases the time required for development is more important than execution speed. The above observations lead us to the conclusion that we need to find a language which is very expressive, allows rapid prototyping and is embeddable into applications. Modern, very high-level scripting languages can easily fulfil these requirements. After the evaluation of several modern scripting languages our choice was Python. ## 3.1 What is Python?3.1 What is Python?3.1 What is Python?3.1 What is Python?3.1 What is Python? Quoting from the Python website: “Python is an interpreted, interactive, object-oriented _programming language. It is often compared to Tcl, Perl,_ _Scheme or Java._ _“Python combines remarkable power with very clear_ _syntax. It has modules, classes, exceptions, very high level_ _dynamic data types, and dynamic typing. There are_ _interfaces to many system calls and libraries, as well as to_ _various windowing systems (X11, Motif, Tk, Mac, MFC)._ _New built-in modules are easily written in C or C++._ _Python is also usable as an extension language for_ _applications that need a programmable interface.” [2]_ Even though Python’s history started over a decade ago it has gained mainstream recognition only quite recently. The language is continuously evolving yet Python developers have been able to maintain excellent backward compatibility with older source code. One slogan of the Python movement is: “Batteries included”. For most tasks all the required modules are included in the standard distribution. These modules range in functionality from simple data handling to server application components. A large number of third-party modules are also available, from simple utilities to scientific computing and number crunching. ## 3.2 Why Python?3.2 Why Python?3.2 Why Python?3.2 Why Python?3.2 Why Python? Python has many properties that make it well suited for our needs. It is easy to learn – anyone with prior programming experience can pick it up really quickly. For newcomers an excellent tutorial is available on the Python website [3]. The language is object-oriented and supports different abstract code and data structures which make it easy to express complex algorithms. Python has built-in support for lists, dictionaries (hashes) and tuples. Using these, algorithms that would take several pages in C can be expressed in just a few lines. This provides a perfect playground for trying out ideas. If one of the ideas does not work out the code can be thrown away without the slightest feeling of regret. The same thing is much more ----- painful after the effort of chasing down all the pointers and off-by-one errors. Of course, just like anything else, Python has some disadvantages too. Most notable is execution speed. The programs are interpreted and sometimes run 2–20 times slower than the C/C++ implementations. Where the execution speed is a primary factor Python might not be the optimal choice. On the other hand Python is extremely easy to extend with C/C++ code. In most cases careful separation of computationally intensive parts into C/C++ extension modules solves the performance problem. Another way of speeding up Python code is the use of Psyco [4], which is a Just-In-Time (JIT) compiler that can achieve significant speedup depending on the type of the code. ## 4. MEET IDAPYTHON4. MEET IDAPYTHON4. MEET IDAPYTHON4. MEET IDAPYTHON4. MEET IDAPYTHON Our attempt to make the raw power of IDA more accessible with a user-friendly, very high-level language resulted in the birth of IDAPython. Implementation-wise IDAPython is an IDA plugin which wraps the IDA API and connects it to the Python environment as an extension module. Scripts run inside IDA have access to both the IDA API and all installed Python modules. IDAPython consists of several modules, organized in three layers: 1. Low-level IDA API wrapper layer 2. IDA API user layer with class definitions 3. IDC-compatibility and utility modules The following figure illustrates the structure of IDAPython: **_Figure 1. IDAPython organization._** ## 4.1 Low-level wrapper layer4.1 Low-level wrapper layer4.1 Low-level wrapper layer4.1 Low-level wrapper layer4.1 Low-level wrapper layer The low-level API layer is a direct Python wrapper around the IDA API. This module is statically linked into the IDAPython plugin and can be imported with the name ‘_idaapi’. The wrapper code is generated by Simplified Wrapper Interface Generator (SWIG) [5]. _idaapi directly reflects the C++ IDA API, functions have the same name, same arguments and return values. The IDA SDK documentation, with few exceptions, is directly applicable to the Python wrappers as well – although changes had to be made where the C++ semantics are significantly different from the Python. These differences from the official IDA API are documented in the plugin package. The differences mostly concern functions that return data in their arguments or use complex or unusual arguments. For example: IDA C++ API version: char *get_func_name( ea_t ea, char *buf, size_t bufsize); The C++ version expects the function address and a pointer to the output buffer with bufsize size. It either returns the address of the output buffer or NULL if an error occurred. The idea of output buffer pointers is not used in Python so we have to change the definition that makes sense there as well. Python _idaapi module version: _idaapi.get_name(from, ea) In the Python version the wrapper allocates memory for the output string and returns either a Python string or None if the function fails. Similar changes have to be made to other functions too. This module should not be used directly, all the functions and data are available in the upper level module described in the next section. ## 4.2 User layer4.2 User layer4.2 User layer4.2 User layer4.2 User layer The user layer module is called ‘idaapi’ which is a thin layer on the top of the ‘_idaapi’ module. It contains all the functions and data from the module below and adds Python class definitions for the wrapped C++ classes, structures and unions. Using these classes complex data structures can be accessed with the usual class.member notation. This is the module that most user programs that need IDA API access should use. ## 4.3 IDC compatibility module4.3 IDC compatibility module4.3 IDC compatibility module4.3 IDC compatibility module4.3 IDC compatibility module Since most experienced IDA users are familiar with the IDC language a compatibility module has been developed. All the IDC definitions are automatically imported from the ‘idc’ module to provide a familiar environment for people with IDC experience. It also makes porting of IDC script to Python easier. The IDC module is developed with maximum compatibility in mind, however there are certain functions that have native implementation in the Python libraries and thus were removed. In the current state over half of the IDC functions have been implemented, the final goal is to implement all functions that are applicable in the Python environment. ## 4.4 Idautils module4.4 Idautils module4.4 Idautils module4.4 Idautils module4.4 Idautils module To lift the burden of using low-level functions for everything a new module was created. Idautils is a module that contains assorted functions that wrap the most often used low-level code. Examples include functions that ----- provide a list representation of result that could previously be collected with *First() and *Next() type of calls and complex loop conditions. The development of ‘idautils’ is in progress and it will be extended with any functionality that qualifies to be included. All the functions are thoroughly documented with Python docstrings, so it is easy to experiment with them. A simple example follows in the next section to highlight the benefits of this module. ## 5. EXAMPLE5. EXAMPLE5. EXAMPLE5. EXAMPLE5. EXAMPLE To demonstrate the advantages of the new approach we are going to implement a relatively short program. The purpose of the program is to enumerate all defined functions in a given segment and list the places they are called from. Typical output of the tool is something like this: Function _start at 0x8048520 Function call_gmon_start at 0x8048544 called from _start(0x8048543) called from .init_proc(0x804848e) Function __do_global_dtors_aux at 0x8048570 called from .term_proc(0x8048741) Function frame_dummy at 0x80485b0 called from .init_proc(0x8048493) Function main at 0x80485e4 Function __libc_csu_init at 0x8048658 Function __libc_csu_fini at 0x80486a0 Function __i686.get_pc_thunk.bx at 0x80486f2 called from __libc_csu_init(0x8048660) called from __libc_csu_fini(0x80486ad) Function __do_global_ctors_aux at 0x8048700 called from .init_proc(0x8048498) This example is deliberately kept simple to focus on the implementation and not the algorithmic behaviour. ## 5.1 IDC version5.1 IDC version5.1 IDC version5.1 IDC version5.1 IDC version First we implemented the small program in IDC: #include static main() { auto ea, func, ref; // Get current ea ea = ScreenEA(); // Loop from start to end in the current segment for (func=SegStart(ea); func ! = BADADDR && func < SegEnd(ea); func=NextFunction(func)) { // If the current address is function process it if (GetFunctionFlags(func) ! = -1) { Message(‘Function %s at 0x%x\n‘, GetFunctionName(func), func); // Find all code references to func for (ref=RfirstB(func); ref ! = BADADDR; ref=RnextB(func, ref)) { Message(‘ called from %s(0x%x)\n‘, GetFunctionName(ref), ref); } } } } The source code does exactly what we want it to do but it’s not too easy on the eyes. It might need more than one glance to understand how it works, even after such short time as going for a lunch break. One of the problems is the C way of iterating through lists, which involves several functions and loops with complex conditions. ## 5.2 Python version5.2 Python version5.2 Python version5.2 Python version5.2 Python version The second version was implemented in Python using only the idaapi module: from idaapi import * # Get current ea ea = get_screen_ea() # Get segment class seg = getseg(ea) # Loop from segment start to end func = get_func(seg.startEA) while func ! = None and func.startEA < seg.endEA: funcea = func.startEA print ‘Function %s at 0x%x‘ % \ (GetFunctionName(funcea), funcea) ref = get_first_cref_to(funcea) while ref ! = BADADDR: print ‘ called from %s(0x%x)‘ % \ (get_func_name(ref), ref) ref = get_next_cref_to(funcea, ref) func = get_next_func(funcea) The syntax has been converted to Python which uses indentation instead of braces for marking code blocks. This program serves as a good example that a good choice of the programming language itself does not warrant easy to read source code. This version of the routine is still not as clear and easy to understand as we would like it to be. ## 5.3 Python version with ‘idautils’5.3 Python version with ‘idautils’5.3 Python version with ‘idautils’5.3 Python version with ‘idautils’5.3 Python version with ‘idautils’ The last version of the script has been rewritten from scratch to utilize both the IDC and idautils module. from idautils import * # Get current ea ea = ScreenEA() # Loop from start to end in the current segment for funcea in Functions(SegStart(ea), SegEnd(ea)): print ‘Function %s at 0x%x‘ % \ (GetFunctionName(funcea), funcea) # Find all code references to funcea for ref in CodeRefsTo(funcea, 1): print ‘ called from %s(0x%x)‘ % \ (GetFunctionName(ref), ref) This version of the tool consists of six lines of effective code which took two to three times more with the other implementations. It is also much easier to understand than the others. Although we are not suggesting that all source will shrink to half its size, we are convinced that with the help of well-written, high-level wrappers over the most common functions code size and development time can be reduced. The main advantage of IDAPython is that any of the layers can be used even in the same code with certain precaution. One does not have to sacrifice the direct database access for IDC or the idautils package. ----- ## 6. IDAPYTHON THE PLUGIN6. IDAPYTHON THE PLUGIN6. IDAPYTHON THE PLUGIN6. IDAPYTHON THE PLUGIN6. IDAPYTHON THE PLUGIN 6.1 Usage6.1 Usage6.1 Usage6.1 Usage6.1 Usage Usage of the plugin is similar to IDC. Python code can be executed by either loading it from a file or entering it into a Python expression window. The latter is only suitable for experiments with short Python statements. Possibility to save the session could be added in the future as time permits. Future enhancements to the user interface might include features like a fully interactive Python session inside IDA. Unfortunately it is not possible to implement that with the current IDA API. The following screenshot shows the Python expression window: **_Figure 2. Example session._** ## 6.2 Status and availability6.2 Status and availability6.2 Status and availability6.2 Status and availability6.2 Status and availability IDAPython is under heavy development, yet in an advanced state already. The plugin is used in-house by the _F-Secure Antivirus Research Team on a regular basis._ More improvements and testing are in the works. By the time this paper is published the first feature complete version of the IDAPython plugin will be released at http://www.d-dome.net/idapython/. The package will include the full source code and can be used for any purpose, free of charge. The reader is invited to try it out and provide much valued feedback. Improvement ideas, code and other contributions are most welcome. ## 7. INTRODUCING GRAPHS7. INTRODUCING GRAPHS7. INTRODUCING GRAPHS7. INTRODUCING GRAPHS7. INTRODUCING GRAPHS 7.1 Pr7.1 Pr7.1 Previous work7.1 Pr7.1 Previous workevious workevious workevious work Far from being a seminal work, this paper attempts to raise the importance of clear and visual methods of analysis. Halvar Flake is a distinguished figure in the binary reverse engineering world, he has been using graphs and similar comparison metrics as used here in specific steps of our algorithm, aiming at tools which find differences between different versions of a given binary. Flake’s work [6] elaborates on the use of graphs for malware analysis and exposes some of the advantages that we will reintroduce in this paper. ## 7.2 The pr7.2 The pr7.2 The proposed appr7.2 The pr7.2 The proposed approposed approposed approachoposed approachoachoachoach Graphs have always provided a clear way of looking at problems. Complex hierarchies instantly become easier to understand once they are represented in a way humans find easy to assimilate. Graphs are one such way of displaying data, instead of more cumbersome representations like endless lists of numbers and references to code and data as provided in a traditional disassembly. Additionally, a rich literature already exists on graph theory with algorithms to treat and analyse them. Once we are consistently representing code as graphs, a door is open to know mathematical methods for analysis and further topics such as clustering. In this paper we will be representing code as graphs on global and function levels. Graphs provide abstraction, and that is exactly what enhances the ability to build simple algorithms to work on top of them. ## 7.3 Pr7.3 Pr7.3 Preparing the data for analysis7.3 Pr7.3 Preparing the data for analysiseparing the data for analysiseparing the data for analysiseparing the data for analysis 7.3.1 Exporting the data7.3.1 Exporting the data7.3.1 Exporting the data7.3.1 Exporting the data7.3.1 Exporting the data A tool was created to export IDA databases to a metalanguage representation, after a given binary has been loaded into IDA. This tool does not export the whole set of information supported in IDA’s database format, but a sufficient subset including: - Every defined function. - Every instruction. The full binary representation, mnemonic and operands as presented by IDA to the user. - All the defined referenced strings and names. - All code references. - Code flags: whether the code is flow (useful to coherently extract the basic blocks of a function). - Miscellaneous flags: whether a function belongs to a known library or is a thunk. All this information has been found to be enough to provide a fertile playground for our research. Future additions will include all comments in order to add as much high-level readable information as possible. The name REML was coined for the resulting metalanguage file, standing for Reverse Engineering _Meta-Language, or RevEngML._ ## 7.3.2 Interfacing the data7.3.2 Interfacing the data7.3.2 Interfacing the data7.3.2 Interfacing the data7.3.2 Interfacing the data The next step was generating an object-oriented framework, focusing on usability and ease of use, while giving lower priority to performance issues. A Python module was created to work on the REML files. The module implements a top Binary class, which ----- provides methods to access the generic binary information such as the entry point to the code and the list of its functions. It is also possible to query functions by their names or addresses in their body. Subsequently, an extensive Function class was implemented. Among others, it provides the following: - Access to all the instructions. - Ingoing and outgoing code and data references. - Generation of the list of basic blocks from which to generate a CFG (Control Flow Graph). A third, smaller class, was added under the rather unoriginal name Instruction in order to access the instructions and their operands. ## 7.3.3 Pr7.3.3 Pr7.3.3 Processing the data7.3.3 Pr7.3.3 Processing the dataocessing the dataocessing the dataocessing the data Our research and prototypes lie on top of the interface we created in order to access the data. It gave us, among other possibilities, that of quickly generating graphs from a binary. In a fairly small number of lines (as compared with IDA’s IDC language) we could iterate over the whole set of functions, gather their references and create a graph object [7]. We took the approach that every single function in a binary is represented by a vertex in the graph and calls between them translate to the graph’s edges. The plotting algorithms and style were fine tuned, over some initial testing period, until we achieved understandable representation of rather complex and extensive binaries. Two main modes of visualization were found to provide good results. A hierarchical representation provides good results for binaries up to certain number of functions; this layout is provided by the GraphViz’s dot. A second, radial layout (Figure 3), was obtained through _GraphViz’s twopi, which proved to be useful to visualize_ larger cases. After the initial visual advantages of graphs, exploiting their mathematical properties became the target of our efforts. This is the topic of the next section. ## 8. ALGORITHMS BASED ON GRAPH8. ALGORITHMS BASED ON GRAPH8. ALGORITHMS BASED ON GRAPH8. ALGORITHMS BASED ON GRAPH8. ALGORITHMS BASED ON GRAPH REPRESENTREPRESENTREPRESENTAAAAATION OF BINARIESREPRESENTREPRESENT TION OF BINARIESTION OF BINARIESTION OF BINARIESTION OF BINARIES 8.1 Binary comparison, graph similarities8.1 Binary comparison, graph similarities8.1 Binary comparison, graph similarities8.1 Binary comparison, graph similarities8.1 Binary comparison, graph similarities In this section we will address the process of finding how close two binaries are to each other. The result of our algorithms will be a value indicating how much of their call flow graphs overlap or relate to each other. Once we have such values, we are in the position of attempting to perform classification of malware by their similarities. Borrowing from the field of Phylogenetics [8], we will apply clustering analysis algorithms in order to obtain a taxonomy of the malware at hand. The results from this approach are encouraging, indicating an area of research which should be explored further. ## 8.1.1 The comparison algorithm8.1.1 The comparison algorithm8.1.1 The comparison algorithm8.1.1 The comparison algorithm8.1.1 The comparison algorithm Given the graph representing the binary under analysis, we will create its adjacency matrix. This matrix will have a row and column for each of the functions present in the binary, and its element in position (I, J) will indicate whether a function in row I performs a call to the function in column J. The objective we pursue is to be able to relate functions from two executables based on their position in the graph resulting from the connections among all the functions. A function which calls and is called by a given set of functions, is matched against another with a similar set of called and caller functions. Finding graph isomorphisms is known to be a fairly complex task, no polynomial time algorithm is known to solve this problem. Fortunately, the problem at hand is much simpler as we have more information to put into our model. To define a basic set of nodes from which to start, we choose operating system and library calls. In other words, any function which does not perform a further call to code belonging to the binary and which will have a common **_Figure 3. Mimail family graphs. Certain patterns can be_** name across different executables. With this approach we **_appreciated between members of this family._** automatically rule out binaries which do not call any such ----- functions, although it would be still applicable if a researcher would manually find a basic set of common functions between two binaries and name them accordingly. Once we have a set of commonly named functions on two different binaries, we compose an adjacency matrix and suppose as identical the functions with identical unique rows in this matrix. We incorporate the information gained in each pass and rerun the matching, effectively progressing through the graph, until a maximum of functions is matched. Additional optimizations have been put in place when ambiguities arise, as function matching according to metrics calculated from their internal structure. After the process in finished, we have a set of functions common to both binaries. ## 8.1.2 Definition of the algorithm8.1.2 Definition of the algorithm8.1.2 Definition of the algorithm8.1.2 Definition of the algorithm8.1.2 Definition of the algorithm - We have two executables’ graphs which we want to compare. A source S and a target one T. We define the following: **– Sf = {set of all functions composing S}** **_– Tf = {set of all functions composing T}_** - Find the common set of atomic functions [9]. That is: - We define Sr = Sf – C and Tr = Tf – C as the sets of remaining functions to be matched. - Now, for we create a call-tree signature [10] for _f from C and consider it identical to_ if the signature of generated from C is identical and unique. – If such match is found we augment the C matrix with the signature of f. which will allow us to match higher level functions not necessarily relying on API functions. - Once this algorithm does not yield any more matches, we will have a matrix C containing all the functions occupying the same positions in the graphs of both S and T. At the end of the process we will have unmatched functions of two types. 1. The ones which are different and can not be matched. 2. Functions which do not have unique signatures. In a further development of the algorithm we enhanced its matching capabilities by using a new kind of signature, namely the list of edges connecting basic blocks from the functions CFG (CFG signatures). Functions with non-unique call-tree signatures can be matched if unique _CFG signatures exist._ We implemented this enhancement to be run every time we exhaust the call-tree signatures. Finding some correspondences between functions will often lead to additional matches in further passes of the algorithm. This has successfully increased the accuracy of our method. ## 8.1.3 The index of similarity8.1.3 The index of similarity8.1.3 The index of similarity8.1.3 The index of similarity8.1.3 The index of similarity Once we have a set of supposedly equivalent functions between both binaries, the index of similarity is obtained as follows. Let’s call the set of equivalent functions in the source binary S, Se = {equivalent functions in S} and similarly define Te = {equivalent functions in T}, note that both sets contain the API calls found to exist in both binaries. Let denote the similarity index function and be defined as: which leads to the non-symmetric similarity indexes (or distances): we define a final combined similarity as: This function, will always yield a value between . . With values close to 0 indicating a low degree of resemblance, and values near 1 for pairs of malware which share almost all of their code. ## 8.1.4 Case studies8.1.4 Case studies8.1.4 Case studies8.1.4 Case studies8.1.4 Case studies We will now give the distance matrices for different families of malware, the values are percentages indicating how close the malware was found to be to each other. Here we will follow the procedure introduced in the previous section. **_Mimail_** The Mimail family exhibits high indexes of resemblance, as can be seen in the following distance matrices. **_Klez_** The Klez family comparison also yields good results. The classification splits the samples in two branches, one containing Klez variants A, B, C and D, and a second one with E, F, G, H, I and J, see Figure 4. ----- **_Netsky_** The large family of Netsky appears clustered together as well. A feature worth mentioning is the closeness to the Sasser family, which was created by the same author. Such resemblance yields high indexes of similarity with our algorithm, given the fact that the author used common code in both families. See Figure 4. **_Roron_** The Roron family is tightly clustered together given the values in distance matrix generated from the graphs. See Figure 4. **_Bagle_** The Bagle worms provide an example where the similarity between certain variants is low. The reason for such results is that some of the variants were just dropping a backdoor and did lack their own replication system, actually rendering them quite different form each other. The algorithm successfully indicates such fact. **_Sasser_** Sasser is another sample of highly similar malware. **_Random selections_** We will now show some distance matrices of randomly picked samples. We have previously shown families of malware, which do not demonstrate well the case where samples are different from each other. The following data shows that unrelated samples yield low similarity indexes. ## 8.2 Possibilities, exploiting the acquir8.2 Possibilities, exploiting the acquir8.2 Possibilities, exploiting the acquired8.2 Possibilities, exploiting the acquir8.2 Possibilities, exploiting the acquiredededed knowledgeknowledgeknowledgeknowledgeknowledge We will now discuss several techniques which benefit from the foundations laid in section 8.1 of the paper. We will briefly present several areas whose overall quality improves by having some metric defining the degree of similarity between malware. This also relates to which parts of two specific binaries correlate, proving similar or the same functionality. Some of the techniques discussed in this section have already been implemented and preliminary results are available; others, although briefly tested, lack a solid prototype. ## 8.2.1 Clustering and classification8.2.1 Clustering and classification8.2.1 Clustering and classification8.2.1 Clustering and classification8.2.1 Clustering and classification Once we created a reliable method to find approximate distances, or similarity indexes between binaries, we can move into such area as classification. Philippe Biondi demonstrated to one of the authors the results of applying clustering algorithms, such as X-tree, to a previously generated distance matrix of malware. Those encouraging results fostered further research of the topic. ----- **_Figure 4. Phylogenetic classification of a group malware._** ----- Recently Stephanie Wehner has done some work on classification using techniques based on Kolmogorov _Complexity. Her preliminary results [11] show_ classifications of worms into families without any underlying analysis performed on the samples. This approach is highly interesting when a fast classification is needed. Our approach is based on phylogenetics. An extensive literature exists on that area on classification of taxa and _cluster analysis. A series of algorithms exist which can_ be applied to distance matrices, that is, a square matrix providing numbers expressing how similar two OTU [12] or malware are to each other. Algorithms such as Neighbor Joining [13] applied to such matrices yield a hierarchical classification, which opens the door to taxonomies of malware. New samples can immediately be classified if a sample fulfilling certain requirements is provided [14]. In the future, given a common algorithm agreed beforehand, this procedure may aid in the classification task and diminish the confusion caused by randomly named malware. The existing approach in biological sciences might prove of value, where popular taxa have colloquial names, besides the purely scientific classification. Previous work [15] has already explored ways the phylogenetic approach may help in computer virus classification, although the metric used in that case was byte sequences present in the malware file. We believe that approach is limited, as even simple polymorphic malware may result in no possibility of application of the technique, let alone packed/encrypted code. While our approach currently faces similar restrictions if aiming for completely automated analysis, we already laid a base which yields acceptable results with a strong metric such as the one provided by graph isomorphisms, see Figure 4. ## 8.2.2 Helping quick analysis8.2.2 Helping quick analysis8.2.2 Helping quick analysis8.2.2 Helping quick analysis8.2.2 Helping quick analysis The knowledge of which functions in two different variants of malware are the same allows us to mine this information, aiding in rapid analysis of threats of the same family. Starting from an already analysed binary, the possibility exists (although it is not implemented yet) to mark certain areas on it, i.e. string references, registry keys, port numbers; and track whether or not they have been modified from one variant to the next. If such modification took place, we could obtain the new values and automatically generate the corresponding information in form of a description for the general public. ## 8.2.3 Migrating analysis information8.2.3 Migrating analysis information8.2.3 Migrating analysis information8.2.3 Migrating analysis information8.2.3 Migrating analysis information between code versionsbetween code versionsbetween code versionsbetween code versionsbetween code versions Porting existing knowledge from the analysis of a variant to a recently discovered one has proved to be a practical and valuable tool. The knowledge of which functions map from a binary A to a binary B allows us to rename all the known code in the, yet to be analysed, new variant. In some cases it allows us to start an analysis with 80–90 per cent of the defined functions already meaningfully named. This reduces the time needed for an initial overview and enables us to focus on more important parts of the code and algorithms and not just reanalysing previously seen code. ## 8.3 Other aspects of code visualisation8.3 Other aspects of code visualisation8.3 Other aspects of code visualisation8.3 Other aspects of code visualisation8.3 Other aspects of code visualisation using graphsusing graphsusing graphsusing graphsusing graphs Recently we created a tool which allows us to browse a disassembly by just looking, scrolling and zooming a graph. The main interface shows the whole defined call tree of the binary, every function node containing a list of the strings referenced from within it. Clicking in any of the functions opens a new window, which shows its CFG (Control Flow Graph) and in each basic block node, a list of function and string references. Placing the mouse over any of the basic blocks shows the corresponding disassembled code. More features have been added, i.e.: - Rename functions - Search functions in the graph - Change visualizations modes (hierarchical or radial) - Compare different versions of binaries (common code is represented in red) - Export the renamed functions as an IDC script to rename them in the IDA DB. The tool has allowed us, so far, to perform quick general analysis of a binary. The holistic perspective provided is really convenient when handling binaries with several layers of abstraction, where the API calls may lay some levels away from the investigated function. In a disassembler such as IDA, perceiving the logical position and purpose of the function within the binary might be nearly impossible without jumping to all the called code. In a graph, that becomes trivial, as all those relations are patent in one sight. ## 9 CONCLUSIONS AND FUTURE9 CONCLUSIONS AND FUTURE9 CONCLUSIONS AND FUTURE9 CONCLUSIONS AND FUTURE9 CONCLUSIONS AND FUTURE DIRECTIONSDIRECTIONSDIRECTIONSDIRECTIONSDIRECTIONS 9.1 Limitations9.1 Limitations9.1 Limitations9.1 Limitations9.1 Limitations Several limitations exist to a fully automated graph generation, comparison and classification approach. Malware is commonly packed and/or encrypted. That represents one major inconvenience which must be overcome before achieving our final goal of full automation. The graph creation and comparison algorithms are architecture-independent. The only requisite is the ----- existence of such concept as calls (branching is acceptable as long as IDA can recognize the functions). The binaries must use API services provided by the operating system, and have a significant quantity of code. ## 9.2 Futur9.2 Futur9.2 Future dir9.2 Futur9.2 Future dire dire directionse directionsectionsectionsections Further research in the classification of malware promises to yield rewarding results. Some of the basic concepts discussed in this paper should be extended. ## 9.2.1 Graph database9.2.1 Graph database9.2.1 Graph database9.2.1 Graph database9.2.1 Graph database The creation of a graph database has not yet been explored. In such a repository, the graph representation and properties of a binary would be stored. Upon arrival of newly discovered malware, the database would be queried for certain patterns. If the search turns out positive results, comparisons and posterior classification would be performed automatically. ## REFERENCES & NOTESREFERENCES & NOTESREFERENCES & NOTESREFERENCES & NOTESREFERENCES & NOTES [1] Datarescue IDA page, http://www.datarescue.com/ idabase/index.htm. [2] Python website, http://www.python.org/. [3] Guido van Rossum and Fred L. Drake, Jr.: Python _Tutorial, http://docs.python.org/tut/tut.html._ [4] Psyco website, http://psyco.sourceforge.net/. [5] SWIG website, http://www.swig.org/. [6] Halvar Flake, ‘Graph-Based Binary Analysis’, _Blackhat Briefings 2002._ [7] We also created a python module, namely pydot, to quickly generate and interact with appealing visual layouts of graphs through AT&T’s renowned GraphViz software. The python module is publicly available at one of the author’s personal website, http://dkbza.org/pydot.html. [8] _Phylogenetics: “Phylogenetics is the taxonomical_ classification of organisms based on how closely they are related in terms of evolutionary differences.” [9] By ‘atomic function’ we mean a function which performs no further calls and has a common name across different executables. In such way that it can be considered to be a basic building block. In graph theory they are the leaf vertices with indegree 1 and outdegree 0. [10] A signature for a function f is created by inspecting the list of atomic functions, La, and creating a binary string, containing a 1 in position n if f calls the atomic function in the index n of La and 0 otherwise. [11] S. Wehner, ‘Analyzing Worms using Compression’, http://homepages.cwi.nl/~wehner/worms/ [12] In phylogenetics, the term OTU stands for Operational Taxonomical Units and refers to members of different species to be classified given their evolutionary differences. [13] N. Saitou and M. Nei, ‘The neighbor-joining method: a new method for reconstructing phylogenetic trees’, Mol. Biol. Evol. 4, pp406–425. [14] The sample must be unpacked/unencrypted and have a properly reconstructed import table. [15] L.A. Goldberg, P.W. Goldberg, C.A. Phillips, G. Sorkin, ‘Constructing Computer Virus Phylogenies’, Journal of Algorithms, Vol 26(1) (Jan. 98) pp.188–208. -----