## Automatically Classifying Unknown Bots by The REGISTER Messages ###### Ya Liu liuya@360.cn Network Security Research Lab, Qihoo 360 ----- ### Outline #### • Polymorphism and malware classification • C&C protocol based classification • REGISTER message based classification • Evaluation • Pitfalls • Conclusions ----- ### Polymorphic Malwares ######  A great many of new samples are captured ##### every day *Malware samples captured in Quarter 1 of 2015 ######  Most of them are polymorphic variants of known ##### malwares ----- ### Malware Classification ##### • The aim is to classify large number of samples into a relative small number of families ###### – e.g., zbot, darkshell, gh0st, ... ##### • Static sample signatures are heavily used by anti-virus products to build virus signature databases ###### – e.g., size, strings, binary code snippets ##### • It has FP/FN issues when dealing with modern polymorphic malwares ----- #### C&C Protocol based Classification • Most of modern malwares are distributed to build botnets • It’s proved effective to classify botnets /malware based on their C&C protocols ##### – message types/formats/interactions are used #### • Detailed C&C protocol specification is a pre- condition ##### – Manual RE is necessary in most cases #### • Scalability issue ----- ### The REGISTER Message #### • The first message exchanged in a C&C session, which MUST be sent by the bot ##### – It’s also called login, hello, call-home #### • Its main usage is to tell the controller: ##### – the bot’s machine configs, e.g., OS version, CPU, memory size, net speed – hardcoded info copied from sample for verifying #### • Many known botnets use this message in their protocols ----- |bot name|supported OS|OS info in REGISTER?|CPU info in REGISTER|memory info in REGISTER| |---|---|---|---|---| |darkshell|win|yes|yes|yes| |elknot|linux/win|yes|yes|yes| |XOR DDoS|linux|yes|yes|no| |chinaz|linux/win|yes|yes|yes| |mayday|linux|yes|yes|no| |dofloo|linux/win|yes|yes|yes| ### Common DDoS Bots’ REGISTERs **OS info** **CPU info** **memory info** ###### supported bot name in in in OS **REGISTER?** **REGISTER** **REGISTER** ###### darkshell win yes yes yes elknot linux/win yes yes yes XOR DDoS linux yes yes no chinaz linux/win yes yes yes mayday linux yes yes no dofloo linux/win yes yes yes ----- ### Elknot’s REGISTER #### • This bot is also called Billgates • It has variable length and binary format struct register_msg { msg_hdr hdr; u8 conf[0x40]; std::string description; u32 cpu_num; u32 cpu_spd; u32 mem_size; std::string os; std::string magic; }; ----- ### Dofloo’s REGISTER ##### • It’s also called Mr. Black • It has text format of “VERSONEX:%s|%d|%d|%s” ###### – VERSONEX:Linux-3.11.0-12-generic|2|3576 MHz|2016MB|634MB|Hacker – VERSONEX:Windows XP|1|3582|Mr.Black – VERSONEX:Windows XP|1|3582 MHz|1024MB|245MB|Hacker ----- ### Shannon Entropy #### • “Entropy is a measure of unpredictability of information content. “ ##### – From en.wikipedia.org #### • Shannon entropy can be used to measure how statistically similar 2 messages are ----- |Family name|Length|Entropy| |---|---|---| |Kelihos|164|4.6~4.8| |XOR DDoS|272|3.22~3.29| |mayday|401|0.4~0.6| |elknot|variable|2.5~2.8| ### Sample REGISTER Statistics ###### Family name Length Entropy Kelihos 164 4.6~4.8 XOR DDoS 272 3.22~3.29 mayday 401 0.4~0.6 elknot variable 2.5~2.8 ----- ### Classification based on REGISTER #### • Rich information included in REGISTER messages ##### – length, entropy value, format, semantics fields #### • A new classification that is based on the similarities among REGISTERs in statistics/structure • It is scalable because the REGISTER message is easy to get ----- ### Objectives #### • To classify unlabeled samples based on their REGISTER messages ##### – Simplify the sample analysis work #### • What we really need to do is to find out the number of REGISTER families, and generate signatures for later identification ----- ### What We not do #### • Will not tell you which cluster of REGISTERs are malicious, and which are not • Will not classify HTTP based REGISTERs ##### – Good solution exists – there is so much classification info (e.g., method, uri, headers) that we think it’s better to classify them in a separate solution ----- ### The Architecture ----- ### REGISTER Profiling #### • Creating REGISTERs from network traces ##### – Mainly parsing PCAP files #### • Setting REGISTER attributes for later clustering and signature generating ##### – Length, entropy, binary/text format, semantic strings ----- ### Sample Profiles { { "bin":1, “bin": 1, "length": 260, "length": 127, "entropy": 0.703393, "entropy": 2.949660 , "strings": [ "strings": [ { { "offset":4, "offset": 55, "size":64, "size": 18, "content":"Windows XP", "content": "08:00:27:6D:C8:C5", "semantics": ”os” "semantics": "mac" }, }, { { "offset":68, "offset": 73, "size":128, "size": 14, "content":"1 * 3187MHz", "content": "Ubuntu 13.10 ", "semantics": “cpu” "semantics": "os" }, }, { { "offset":196, "offset": 87, "size":32, "size": 40, "content":"128MB", "content": "Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz", "semantics": “memory” "semantics": "cpu" } } ] ] } } ----- ### Coarse-grained Clustering #### • To group statistically similar REGISTERs ##### – k-means algorithm is used to cluster vectors of #### • To reduce the computation cost ##### – A O(N[2]) computation cost is needed if we attempt to directly find out structurally similar REGISTERs ----- ### Finding Semantic Strings #### • A heuristic deduction procedure ##### – OS: “linux”, “Ubuntu 13.10”,”Win XP” – Memory: “2016MB”, “2016 MB” – CPU: “Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz”, “MHz: 3576,3576”, “3582 MHz” #### • Every semantic string has following attributes: ##### – semantics – offset – size ----- ### Length Field #### • 3 types of semantics ##### – len_value=len3 – len_value=len2+len3 – len_value=len1+len2+len3 #### • Field size ##### – 32-bit/16-bit/8-bit #### • Byte order ##### – Host-byte-order or network-byte-order ----- ### Fine-grained Clustering #### • To find out structurally similar REGISTERs • 2 REGISTERs are considered as structurally similar if and only if: ##### 1. Having similar entropy values 2. Sharing the same set of semantics strings and their placing order 3. Sharing the same format of length field 4. Sharing the same encoding format ###### • binary or text ##### 5. Having similar length ----- ### Sample Signatures { "name":"L172O0S11T1448427268.607769", "ordinal":[0.5, 1.0], "type":"normal", "length":172, "entropy":3.48832, “patterns": [ {"type":"rawbytes", "offset":0, "length":11, "content":"31 36 38 00 6C 6C 7C 27 7C 27 7C"} ] } { "name":"L126O4S16T1448427256.926312", "ordinal":[0.5, 1.0], "type":"normal", "length":126, "entropy":2.74977, “patterns": [ {"type":"rawbytes", "offset":4, "length":16, "content":"76 65 72 73 69 6F 6E 00 00 00 00 00 66 00 00 00"} ] } ----- ### Signature Generation #### • For each group of structurally similar REGISTERs a set of signatures are generated • Generation steps includes: ##### 1. Finding out frequent items of (offset, byte_value) 2. Merging offset continuing items 3. Normalizing them into valid signatures #### • Some policies: ##### – AT_LEAST_OCCURS, default is 3 – AT_LEAST_SIG_BYTES, default is 4 ----- ### Apriori/FP-Growth and Sig Bytes #### • “Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases.” ##### – From en.wikipedia.org #### • Currently we use Apriori to find the frequent items of (offset, byte_value) among REGISTERs • We will update our solution to FP-Growth for better performance ----- ### Signature Types #### • Normal ##### – specific byte patterns exist at specified offsets #### • PCRE : ##### – replacing semantic patterns with equivilant PCRE expressions, e.g., “Windows\s.*”, #### • Entropy: ##### – No valid patterns could be generated – AND all REGISTERs have the same length and very close entropy values ----- ### Evaluation #### • Our system is implemented in C++ and python ##### – About 2,500 lines of C++ code. #### • It takes less than 30 minutes to classify 10K REGISTERs ##### – Performed on a 4-core Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz machine with 4GB of RAM – Single thread ----- ### Choice of k #### • k=20 is the best choice for k-means clustering when doing coarse-grained clustering ###### matched 25000 20000 15000 10000 5000 0 0 50 100 150 200 250 300 350 400 450matched ----- ### False Negatives/ False Positives ----- ### Generated Signature: STUN { "name":"L28O0S4O20S7T1447301150.241028", "ordinal":[0.5, 1.0], "sigtype":"normal", "length":28, "entropy":2.79622, "blocks": [ {"type":"rawbytes", "offset":0, "length":4, "content":"00 01 00 08"}, {"type":"rawbytes", "offset":20, "length":7, "content":"00 03 00 04 00 00 00"} ] } ----- ### Generated Signature: SSL { “name”:”L45O0S29T1447301147.172219”, “ordinal”:[0.5, 1.0], “type”:”normal”, “length”:45, “entropy”:2.93596, “blocks”: [ {“type”:”rawbytes”, “offset”:0, “length”:29, “content”:”80 2B 01 00 02 00 12 00 00 00 10 01 00 80 07 00 C0 03 00 80 06 00 40 02 00 80 04 00 80”} ] } ----- ### Generated Signature: Bladabindi { "name":"L158O0S7O31S1O43S4O51S1O66S1O72S1O81S1O85S1O103S1O134S24T1447301149.680667", "ordinal":[0.5, 1.0], "type":"normal", "length":158, "entropy":3.3299, "blocks": [ {"type":"rawbytes", "offset":0, "length":7, "content":"6C 76 7C 27 7C 27 7C"}, {"type":"rawbytes", "offset":31, "length":1, "content":"7C"}, {"type":"rawbytes", "offset":43, "length":4, "content":"42 4F 4F 4D"}, {"type":"rawbytes", "offset":51, "length":1, "content":"7C"}, {"type":"rawbytes", "offset":66, "length":1, "content":"7C"}, {"type":"rawbytes", "offset":72, "length":1, "content":"30"}, {"type":"rawbytes", "offset":81, "length":1, "content":"7C"}, {"type":"rawbytes", "offset":85, "length":1, "content":"7C"}, {"type":"rawbytes", "offset":103, "length":1, "content":"73"}, {"type":"rawbytes", "offset":134, "length":24, "content":"7C 27 7C 27 7C 2E 2E 7C 27 7C 27 7C 7C 27 7C 27 7C 5B 65 6E 64 6F 66 5D"} ] } ----- ### Generated Signature: Nitol { "name":"LXO1S18T1448629222.519142", "ordinal":[0.5, 1.0], "type":"normal", "entropy":1.2787, "blocks": [ {"type":"rawbytes", "offset":1, "length":18, "content":"00 00 00 77 00 00 00 09 04 00 00 57 69 6E 20 58 50 20"} ] } ----- ### Pitfalls #### • REGISTER is not always used in C&C protocols • For UDP based C&C protocol, it’s hard to tell which message is REGISTER because of its statelessness nature • The same REGISTER may be shared across different C&C protocols • Our solution is not good at classifying variable-length text format REGISTERs ----- ### Conclusions #### • Statistical/structural similarities can be used to effectively classify REGISTERs • REGISTER based classification can complement C&C protocol based classification • Our solution is good at classifying binary format REGISTERs with fixed lengths ----- # Q&A -----