## Automatically Classifying Unknown Bots 
 by The REGISTER Messages

###### Ya Liu liuya@360.cn Network Security Research Lab, Qihoo 360


-----

### Outline

#### • Polymorphism and malware classification
 • C&C protocol based classification
 • REGISTER message based classification
 • Evaluation
 • Pitfalls
 • Conclusions


-----

### Polymorphic Malwares

######  A great many of new samples are captured

##### every day

*Malware samples captured in Quarter 1 of 2015

######  Most of them are polymorphic variants of known

##### malwares


-----

### Malware Classification

##### • The aim is to classify large number of samples into a
 relative small number of families

###### – e.g., zbot, darkshell, gh0st, ...

##### • Static sample signatures are heavily used by anti-virus
 products to build virus signature databases

###### – e.g., size, strings, binary code snippets

##### • It has FP/FN issues when dealing with modern
 polymorphic malwares


-----

#### C&C Protocol based Classification

 • Most of modern malwares are distributed to
 build botnets
 • It’s proved effective to classify botnets
 /malware based on their C&C protocols

##### – message types/formats/interactions are used

#### • Detailed C&C protocol specification is a pre-
 condition

##### – Manual RE is necessary in most cases

#### • Scalability issue


-----

### The REGISTER Message

#### • The first message exchanged in a C&C session,
 which MUST be sent by the bot

##### – It’s also called login, hello, call-home

#### • Its main usage is to tell the controller:

##### – the bot’s machine configs, e.g., OS version, CPU,
 memory size, net speed – hardcoded info copied from sample for verifying

#### • Many known botnets use this message in their
 protocols


-----

|bot name|supported OS|OS info in REGISTER?|CPU info in REGISTER|memory info in REGISTER|
|---|---|---|---|---|
|darkshell|win|yes|yes|yes|
|elknot|linux/win|yes|yes|yes|
|XOR DDoS|linux|yes|yes|no|
|chinaz|linux/win|yes|yes|yes|
|mayday|linux|yes|yes|no|
|dofloo|linux/win|yes|yes|yes|


### Common DDoS Bots’ REGISTERs

**OS info** **CPU info** **memory info**

###### supported  bot name in  in in 
 OS

**REGISTER?** **REGISTER** **REGISTER**

###### darkshell win yes yes yes

 elknot linux/win yes yes yes

 XOR DDoS linux yes yes no

 chinaz linux/win yes yes yes

 mayday linux yes yes no

 dofloo linux/win yes yes yes


-----

### Elknot’s REGISTER

#### • This bot is also called Billgates
 • It has variable length and binary format

struct register_msg {
msg_hdr hdr;
u8 conf[0x40];
std::string description;
u32 cpu_num;
u32 cpu_spd;
u32 mem_size;
std::string os;
std::string magic;
};


-----

### Dofloo’s REGISTER

##### • It’s also called Mr. Black
 • It has text format of “VERSONEX:%s|%d|%d|%s”

###### – VERSONEX:Linux-3.11.0-12-generic|2|3576
 MHz|2016MB|634MB|Hacker – VERSONEX:Windows XP|1|3582|Mr.Black – VERSONEX:Windows XP|1|3582
 MHz|1024MB|245MB|Hacker


-----

### Shannon Entropy

#### • “Entropy is a measure of unpredictability of
 information content. “

##### – From en.wikipedia.org

#### • Shannon entropy can be used to measure how
 statistically similar 2 messages are


-----

|Family name|Length|Entropy|
|---|---|---|
|Kelihos|164|4.6~4.8|
|XOR DDoS|272|3.22~3.29|
|mayday|401|0.4~0.6|
|elknot|variable|2.5~2.8|


### Sample REGISTER Statistics

###### Family name Length Entropy

 Kelihos 164 4.6~4.8

 XOR DDoS 272 3.22~3.29

 mayday 401 0.4~0.6

 elknot variable 2.5~2.8


-----

### Classification based on REGISTER

#### • Rich information included in REGISTER
 messages

##### – length, entropy value, format, semantics fields

#### • A new classification that is based on the
 similarities among REGISTERs in statistics/structure
 • It is scalable because the REGISTER message is
 easy to get


-----

### Objectives

#### • To classify unlabeled samples based on their
 REGISTER messages

##### – Simplify the sample analysis work

#### • What we really need to do is to find out the
 number of REGISTER families, and generate signatures for later identification


-----

### What We not do

#### • Will not tell you which cluster of REGISTERs are
 malicious, and which are not
 • Will not classify HTTP based REGISTERs

##### – Good solution exists – there is so much classification info (e.g., method,
 uri, headers) that we think it’s better to classify them in a separate solution


-----

### The Architecture


-----

### REGISTER Profiling

#### • Creating REGISTERs from network traces 

##### – Mainly parsing PCAP files

#### • Setting REGISTER attributes for later clustering
 and signature generating

##### – Length, entropy, binary/text format, semantic
 strings


-----

### Sample Profiles

{ {
"bin":1, “bin": 1,
"length": 260, "length": 127,
"entropy": 0.703393, "entropy": 2.949660 ,
"strings": [ "strings": [
{ {
"offset":4, "offset": 55,
"size":64, "size": 18,
"content":"Windows XP", "content": "08:00:27:6D:C8:C5",
"semantics": ”os” "semantics": "mac"
}, },
{ {
"offset":68, "offset": 73,
"size":128, "size": 14,
"content":"1 * 3187MHz", "content": "Ubuntu 13.10 ",
"semantics": “cpu” "semantics": "os"
}, },
{ {
"offset":196, "offset": 87,
"size":32, "size": 40,
"content":"128MB", "content": "Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz",
"semantics": “memory” "semantics": "cpu"
} }
] ]
} }


-----

### Coarse-grained Clustering

#### • To group statistically similar REGISTERs

##### – k-means algorithm is used to cluster vectors of
 <length, entropy>

#### • To reduce the computation cost

##### – A O(N[2]) computation cost is needed if we attempt
 to directly find out structurally similar REGISTERs


-----

### Finding Semantic Strings

#### • A heuristic deduction procedure

##### – OS: “linux”, “Ubuntu 13.10”,”Win XP” – Memory: “2016MB”, “2016 MB” – CPU: “Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz”,
 “MHz: 3576,3576”, “3582 MHz”

#### • Every semantic string has following attributes:

##### – semantics – offset – size


-----

### Length Field

#### • 3 types of semantics

##### – len_value=len3 – len_value=len2+len3 – len_value=len1+len2+len3

#### • Field size

##### – 32-bit/16-bit/8-bit

#### • Byte order

##### – Host-byte-order or network-byte-order


-----

### Fine-grained Clustering

#### • To find out structurally similar REGISTERs
 • 2 REGISTERs are considered as structurally
 similar if and only if:

##### 1. Having similar entropy values 2. Sharing the same set of semantics strings and their
 placing order  3. Sharing the same format of length field 4. Sharing the same encoding format

###### • binary or text
##### 5. Having similar length


-----

### Sample Signatures

{
"name":"L172O0S11T1448427268.607769",
"ordinal":[0.5, 1.0],
"type":"normal",
"length":172,
"entropy":3.48832,
“patterns":
[
{"type":"rawbytes", "offset":0, "length":11, "content":"31 36 38 00 6C 6C 7C 27 7C 27 7C"}
]
}

{
"name":"L126O4S16T1448427256.926312",
"ordinal":[0.5, 1.0],
"type":"normal",
"length":126,
"entropy":2.74977,
“patterns":
[
{"type":"rawbytes", "offset":4, "length":16, "content":"76 65 72 73 69 6F 6E 00 00 00 00 00 66 00 00 00"}
]
}


-----

### Signature Generation

#### • For each group of structurally similar
 REGISTERs a set of signatures are generated
 • Generation steps includes:

##### 1. Finding out frequent items of (offset, byte_value) 2. Merging offset continuing items 3. Normalizing them into valid signatures

#### • Some policies:

##### – AT_LEAST_OCCURS, default is 3 – AT_LEAST_SIG_BYTES, default is 4


-----

### Apriori/FP-Growth and Sig Bytes

#### • “Apriori is an algorithm for frequent item set
 mining and association rule learning over transactional databases.”

##### – From en.wikipedia.org

#### • Currently we use Apriori to find the frequent
 items of (offset, byte_value) among REGISTERs
 • We will update our solution to FP-Growth for
 better performance


-----

### Signature Types

#### • Normal

##### – specific byte patterns exist at specified offsets

#### • PCRE : 

##### – replacing semantic patterns with equivilant PCRE
 expressions, e.g., “Windows\s.*”,

#### • Entropy: 

##### – No valid patterns could be generated – AND all REGISTERs have the same length and very
 close entropy values


-----

### Evaluation

#### • Our system is implemented in C++ and
 python

##### – About 2,500 lines of C++ code.

#### • It takes less than 30 minutes to classify 10K
 REGISTERs

##### – Performed on a 4-core Intel(R) Core(TM) i7-4790
 CPU @ 3.60GHz machine with 4GB of RAM – Single thread


-----

### Choice of k

#### • k=20 is the best choice for k-means clustering
 when doing coarse-grained clustering

###### matched

25000

20000

15000

10000

5000

0

0 50 100 150 200 250 300 350 400 450matched


-----

### False Negatives/ False Positives


-----

### Generated Signature: STUN

{
"name":"L28O0S4O20S7T1447301150.241028",
"ordinal":[0.5, 1.0],
"sigtype":"normal",
"length":28,
"entropy":2.79622,
"blocks":
[
{"type":"rawbytes", "offset":0, "length":4, "content":"00 01 00 08"},
{"type":"rawbytes", "offset":20, "length":7, "content":"00 03 00 04 00 00 00"}
]
}


-----

### Generated Signature: SSL 

{
“name”:”L45O0S29T1447301147.172219”,
“ordinal”:[0.5, 1.0],
“type”:”normal”,
“length”:45,
“entropy”:2.93596,
“blocks”:
[
{“type”:”rawbytes”, “offset”:0, “length”:29, “content”:”80 2B 01 00 02 00 12 00 00 00 10 01 00 80 07 00 C0 03 00 80 06 00
40 02 00 80 04 00 80”}
]
}


-----

### Generated Signature: Bladabindi

{
"name":"L158O0S7O31S1O43S4O51S1O66S1O72S1O81S1O85S1O103S1O134S24T1447301149.680667",
"ordinal":[0.5, 1.0],
"type":"normal",
"length":158,
"entropy":3.3299,
"blocks":
[
{"type":"rawbytes", "offset":0, "length":7, "content":"6C 76 7C 27 7C 27 7C"},
{"type":"rawbytes", "offset":31, "length":1, "content":"7C"},
{"type":"rawbytes", "offset":43, "length":4, "content":"42 4F 4F 4D"},
{"type":"rawbytes", "offset":51, "length":1, "content":"7C"},
{"type":"rawbytes", "offset":66, "length":1, "content":"7C"},
{"type":"rawbytes", "offset":72, "length":1, "content":"30"},
{"type":"rawbytes", "offset":81, "length":1, "content":"7C"},
{"type":"rawbytes", "offset":85, "length":1, "content":"7C"},
{"type":"rawbytes", "offset":103, "length":1, "content":"73"},
{"type":"rawbytes", "offset":134, "length":24, "content":"7C 27 7C 27 7C 2E 2E 7C 27 7C 27 7C 7C 27 7C 27 7C 5B 65
6E 64 6F 66 5D"}
]
}


-----

### Generated Signature: Nitol

{
"name":"LXO1S18T1448629222.519142",
"ordinal":[0.5, 1.0],
"type":"normal",
"entropy":1.2787,
"blocks":
[
{"type":"rawbytes", "offset":1, "length":18, "content":"00 00 00 77 00 00 00 09 04 00 00 57 69 6E 20 58 50 20"}
]
}


-----

### Pitfalls

#### • REGISTER is not always used in C&C protocols
 • For UDP based C&C protocol, it’s hard to tell
 which message is REGISTER because of its statelessness nature
 • The same REGISTER may be shared across
 different C&C protocols
 • Our solution is not good at classifying
 variable-length text format REGISTERs


-----

### Conclusions

#### • Statistical/structural similarities can be used to
 effectively classify REGISTERs
 • REGISTER based classification can
 complement C&C protocol based classification
 • Our solution is good at classifying binary
 format REGISTERs with fixed lengths


-----

# Q&A


-----