[See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/320250366](https://www.researchgate.net/publication/320250366_Botnet_protocol_inference_in_the_presence_of_encrypted_traffic?enrichId=rgreq-2041a270478a3e077c78db92bc7d8875-XXX&enrichSource=Y292ZXJQYWdlOzMyMDI1MDM2NjtBUzo5NTU5ODQyNTE5OTgyMDlAMTYwNDkzNTgxNTU0NA%3D%3D&el=1_x_2&_esc=publicationCoverPdf)

### Botnet protocol inference in the presence of encrypted traffic

**Conference Paper · May 2017**

DOI: 10.1109/INFOCOM.2017.8057064


CITATIONS
17

**5 authors, including:**

[Lorenzo De Carli](https://www.researchgate.net/profile/Lorenzo-De-Carli?enrichId=rgreq-2041a270478a3e077c78db92bc7d8875-XXX&enrichSource=Y292ZXJQYWdlOzMyMDI1MDM2NjtBUzo5NTU5ODQyNTE5OTgyMDlAMTYwNDkzNTgxNTU0NA%3D%3D&el=1_x_5&_esc=publicationCoverPdf)

[The University of Calgary](https://www.researchgate.net/institution/The_University_of_Calgary?enrichId=rgreq-2041a270478a3e077c78db92bc7d8875-XXX&enrichSource=Y292ZXJQYWdlOzMyMDI1MDM2NjtBUzo5NTU5ODQyNTE5OTgyMDlAMTYwNDkzNTgxNTU0NA%3D%3D&el=1_x_6&_esc=publicationCoverPdf)

**46** PUBLICATIONS **440** CITATIONS


READS
558


[Alok Tongaonkar](https://www.researchgate.net/profile/Alok-Tongaonkar?enrichId=rgreq-2041a270478a3e077c78db92bc7d8875-XXX&enrichSource=Y292ZXJQYWdlOzMyMDI1MDM2NjtBUzo5NTU5ODQyNTE5OTgyMDlAMTYwNDkzNTgxNTU0NA%3D%3D&el=1_x_5&_esc=publicationCoverPdf)

**21** PUBLICATIONS **539** CITATIONS


[SEE PROFILE](https://www.researchgate.net/profile/Alok-Tongaonkar?enrichId=rgreq-2041a270478a3e077c78db92bc7d8875-XXX&enrichSource=Y292ZXJQYWdlOzMyMDI1MDM2NjtBUzo5NTU5ODQyNTE5OTgyMDlAMTYwNDkzNTgxNTU0NA%3D%3D&el=1_x_7&_esc=publicationCoverPdf)


[SEE PROFILE](https://www.researchgate.net/profile/Lorenzo-De-Carli?enrichId=rgreq-2041a270478a3e077c78db92bc7d8875-XXX&enrichSource=Y292ZXJQYWdlOzMyMDI1MDM2NjtBUzo5NTU5ODQyNTE5OTgyMDlAMTYwNDkzNTgxNTU0NA%3D%3D&el=1_x_7&_esc=publicationCoverPdf)


[Somesh Jha](https://www.researchgate.net/profile/Somesh-Jha-3?enrichId=rgreq-2041a270478a3e077c78db92bc7d8875-XXX&enrichSource=Y292ZXJQYWdlOzMyMDI1MDM2NjtBUzo5NTU5ODQyNTE5OTgyMDlAMTYwNDkzNTgxNTU0NA%3D%3D&el=1_x_5&_esc=publicationCoverPdf)

[University of Wisconsin–Madison](https://www.researchgate.net/institution/University_of_Wisconsin-Madison?enrichId=rgreq-2041a270478a3e077c78db92bc7d8875-XXX&enrichSource=Y292ZXJQYWdlOzMyMDI1MDM2NjtBUzo5NTU5ODQyNTE5OTgyMDlAMTYwNDkzNTgxNTU0NA%3D%3D&el=1_x_6&_esc=publicationCoverPdf)

**298** PUBLICATIONS **32,738** CITATIONS


[SEE PROFILE](https://www.researchgate.net/profile/Somesh-Jha-3?enrichId=rgreq-2041a270478a3e077c78db92bc7d8875-XXX&enrichSource=Y292ZXJQYWdlOzMyMDI1MDM2NjtBUzo5NTU5ODQyNTE5OTgyMDlAMTYwNDkzNTgxNTU0NA%3D%3D&el=1_x_7&_esc=publicationCoverPdf)


-----

# Botnet Protocol Inference in the Presence of Encrypted Traffic[∗]

#### Lorenzo De Carli[1] Ruben Torres[2] Gaspar Modelo-Howard[2] Alok Tongaonkar[3] Somesh Jha[4]

1Colorado State University 2Symantec 3RedLock Inc. 4University of Wisconsin, Madison


**_Abstract—Network protocol reverse engineering of botnet_**
**command and control (C&C) is a challenging task, which**
**requires various manual steps and a significant amount of domain**
**knowledge. Furthermore, most of today’s C&C protocols are**
**encrypted, which prevents any analysis on the traffic without first**
**discovering the encryption algorithm and key. To address these**
**challenges, we present an end-to-end system for automatically**
**discovering the encryption algorithm and keys, generating a**
**protocol specification for the C&C traffic, and crafting effective**
**network signatures. In order to infer the encryption algorithm**
**and key, we enhance state-of-the-art techniques to extract this in-**
**formation using lightweight binary analysis. In order to generate**
**protocol specifications we infer field types purely by analyzing**
**network traffic. We evaluate our approach on three prominent**
**malware families: Sality, ZeroAccess and Ramnit. Our results**
**are encouraging: the approach decrypts all three protocols,**
**detects 97% of fields whose semantics are supported, and infers**
**specifications that correctly align with real protocol specifications.**

I. INTRODUCTION

Modern malware increasingly instantiates botnets: network
of compromised computers that perform a diverse set of malicious activities. Usually motivated by financial gain, botnets
incorporate a wide range of functionalities, such as launching
DDoS and spam campaigns, stealing personal information,
and committing e-commerce fraud. The large size of many
botnets [1], [2] make them an extremely effective attack tool.
A crucial aspect of botnets, given their distributed nature, is
network communication. Bots on infected machines routinely
“phone home” to botnet servers, receive instructions, and perform data exfiltration. Understanding the semantics of botnet
communications allows the analyst to quickly gain insight to
the malware’s mechanisms, and to generate network signatures
to detect malicious traffic. Understanding malware communications can also enable remediation actions and forensics.
In order to understand bot network behavior, analysts
use a variety of techniques such as manual traffic collection/inspection and sandboxed execution of malware. Increasing use of encrypted communication also imposes manual debugging of malware binaries to analyze and reverse encryption.
These procedures are time-consuming, error-prone and unable
to cope with the rates at which new malware appears. Limited,
manually-inferred protocol understanding may also result in
ineffective signatures, which are excessively specific to the
available traffic samples, and can be easily circumvented.
In this paper, we propose a novel technique for automatic
inference of malware network protocol specifications, given

_∗_ Supported in part by NSF grant CNS-1228782. Emails: ldecarli@colostate.edu, {ruben torresguerra, gaspar modelohoward}@symantec.com, alok@redlock.io, jha@cs.wisc.edu.


samples of a malware’s communications and the malware
binary. We focus on malware Command & Control (C&C)
protocols, used by botmasters to control malware-infected
hosts and to execute various malicious activities. Such communications tend to be based on custom binary formats [3],
making it possible to specialize the analysis to this domain.
Also, each C&C protocol is specific to its malware family—
providing the advantages of constituting a reliable fingerprint,
and giving insight into the malware structure and intent.
Reverse-engineering a malware protocol is extremely difficult: messages are oftentimes ambiguous, and may contain
errors or purposely-injected noise. Our approach casts it as a
_type inferencing problem [4]: we assume that each message_
consists of a sequence of binary fields, and we define a type
_system describing all possible field types. We then run a novel_
type inferencing algorithm to infer message structure.
Another significant issue is that—given the wide adoption of
encryption in C&C network traffic [3]—decryption is required
before inference can be accomplished. We therefore also
propose a system to extract C&C encryption keys by applying
dynamic analysis on the malware binary. Our approach extends
existing techniques (such as [5]), focusing their application on
uses of encryption involving network data.
Our approach can significantly alleviate the analysis burden
on human analysts, facilitating timely deployment of countermeasures against new malware. Also, a signature generation
step based on our specifications can leverage rich semantic
understanding, leading to signatures that are resilient to incremental changes in the C&C protocol. Finally, differently from
previous works we strive to provide a practical workflow for
the analyst, by: (i) allowing to leverage both traffic generated
in a controlled environment and samples captured in the wild,
and (ii) discovering rich field types that provide actionable
material (e.g. downloaded executables).
We evaluated our approach on three different malware
protocols. Results demonstrate that our prototype correctly
infers details of the encryption used in each protocol, enabling
decryption. Furthermore, it detects 97% of fields whose type
is supported by our approach, with negligible false positives.
Finally, an evaluation of signatures based on the generated
specification shows 98% sample coverage/no false positives.
**In summary, this paper contributes: (i) a technique**
_for rich malware protocol inference requiring only decrypted_
_network traffic, and (ii) a practical, scalable solution to infer_
_details of encryption used by malware in network protocols,_
_leveraging state-of-the art encryption analysis techniques._


-----

|Col1|Col2|
|---|---|
||B.3|

|B.1-B.3|Algo: RC4 Key: 0769F9C5...|
|---|---|

|B.4|Col2|
|---|---|
|||

|C.1|Col2|
|---|---|
|||

|C.2|Col2|
|---|---|
|||


**Malware**


**(a) System architecture**


**Sandboxed Execution: Traffic generation**

**A** E5AAC031EA6423733B5B741C499F39C1A627...

**Trace analysis (Encryption Profile)**

**B.1-B.3** **Algo: RC4 Key: 0769F9C5...**

**Encryption inference/Message decryption**

**B.4** FE3467CDF24535AC46746573000400004D5A...

**Message clustering by type (using detected type field)**

**C.1** FE3467CDF24535AC| Type |000400004D5A...

**Content field inference**
**C.2** **|Magic |F24535AC| Type |00040000| Exe...**

**Dependent field inference**

**C.3** |Magic | CRC  | Type |Exe len | Exe...

#### ...

**Specification abstraction**

Field 1: 4-byte Magic, Field 2: 4-byte CRC, Field 3: 4
**C.5**

byte Msg Type, Field 4: 4-byte Field 5 len, Field 5: EXE

**(b) Protocol decryption and inference example**


Fig. 1: Outline of the proposed system and example of operation


II. OVERVIEW


We decompose the problem of reverse-engineering C&C
protocols into: (1) C&C traffic decryption, using dynamic
analysis to extract encryption keys from malware executions,
and (2) automatic derivation of protocol specification via type
inference over decrypted C&C traffic. This section outlines
both steps, using as an example a message format from the
C&C protocol of the ZeroAccess malware [6] (details of each
step are then given in § III). Such message format is used by
ZeroAccess to download executable payloads to an infected
machine. It consists of a sequence of binary fields: (i) 4-byte
magic value that identifies the protocol, (ii) 4-byte message
CRC, (iii) 4-byte message type, (iv) 4-byte payload length, and
(v) payload (executable file). Each message is fully encrypted
using the RC4 cipher and a static key. The overall system
architecture is depicted in Fig. 1(a). Fig. 1(b) highlights the
output of each step using an example message.
_a) Encryption analysis: In step A, the malware of in-_
terest is executed in a sandbox which traces machine instructions, network system calls, and generated/received network
packets. The instruction trace is then processed by a sequence
of analysis steps. Step B.1 pre-selects candidate instruction
sequences that behave like encryption functions. Step B.2
performs a more detailed analysis of each candidate, either
mapping it to a specific known encryption cipher, or discarding
it. Each surviving candidate represents the dynamic execution
of an encryption function. Step B.3 matches the output of
each candidate to the input of network system calls, retaining
only candidates whose output is sent on the network (the
other candidates likely represent non-C&C related events—
e.g. encryption of local files). Step B.4 maps the retained
uses of encryption to the payload of C&C packets emitted
during sandbox execution. It then heuristically infers if the
encryption key is either static or derived from the payload: if
either condition is true, decryption of any malware messages
(not just sandbox-generated ones) becomes possible. At the


end of this step, all messages in the dataset get decrypted.
**Integrating additional samples: Once our approach has**
learned decryption information, it can decrypt any protocol
message. This enables the analyst to feed additional C&C messages from external sources (honeypots, etc.) into the protocol
analysis stage, potentially inferring a richer specification.
_b) Protocol analysis: Most C&C protocols define mul-_
tiple message types for different purposes. As each type has
a distinct structure, it is crucial that messages are clustered
by type, so each type can be analyzed separately. Step C.1
detects fields specifying the message type, and clusters messages based on the values of those fields. Step C.2 identifies
_content fields, i.e. fields whose structure is self-evident and_
_independent from the context (the rest of the message). The_
first and last field of the example message in Fig. 1(b) fall
in this category: the Magic field can be determined as it
holds a constant value across messages. The EXE file field is
also easily identifiable as PE executables have a recognizable
structure. We instantiated an initial set of type definitions based
on our experience with C&C protocols; this set can be easily
extended/customized for specific analysis tasks. Step C.3 determines dependent field types. These fields express properties
of other fields or of the message. Our example message
contains two: Message CRC and Payload Length (which in this
case expresses the length of the EXE file). Step C.4 detects
_composite field types, such as lists of content or dependent_
fields (not present in Fig. 1(b)). Finally, Step C.5 reconciles the
sequence of fields extracted from each individual message into
a single protocol specification using sequence alignment [7].

III. PROTOCOL INFERENCE
_A. Sandboxed execution_

The first component of our system investigates whether the
malware uses encryption algorithms. It begins by running the
malware in a controlled environment consisting of a virtual
machine augmented with program tracing (sandbox in the
following). This stage collects three different program traces:


-----

_• A system call trace S, capturing malware-triggered system_
calls used to send network data (e.g. send()). Each call is
represented as a set R[S] of memory locations and registers
used to pass network-bound data to the OS. S therefore
consists of a sequence of n such sets R1[S][...R]n[S][.]

_• An instruction trace T_, representing an ordered sequence
of m processor instructions executed by the malware.
Formally, T = t1...tm, where each instruction ti is a
tuple (i, a, R[T] _, W_ _[T]_ ). i is the instruction opcode, a the
instruction address in memory, R[T] the set of memory
locations/registers read by the instruction, and W _[T]_ the set
of memory locations/registers written by the instruction.

_• A network trace N of all packets generated by the sandbox._

_B. Encryption Analysis_

Once traces of the malware execution are available, we
analyze them to locate instances of encryption use.
_1) Candidate Detection:_ This step executes a candi_date detection procedure, which takes as input the instruction_
trace T and locates a set C of candidate encryption instances.
Each candidate in C is a sequence D ⊆ _T_, representing the
possible execution of an encryption primitive.
In order to detect encryption in instruction traces several
approaches could be used, such as [5], [8], [9]. In our
evaluation we use ALIGOT [5] as it is a recent approach with
publicly available and well-documented source code. ALIGOT
is based on the insight that loops are recurring structures in
cipher implementations, therefore searching execution traces
for chains of dynamic loops. We observed however that the
base ALIGOT algorithm does not scale to traces larger than a
few tens of thousands of instructions, which prevents a direct
application to our case. We concluded that the issue lies in
ALIGOT’s loop detection, which treats each new instruction as
the potential beginning of a new loop, reaching an impractical
memory footprint in long sections of straight-line code. After
determining that dynamic loops can be seen as maximal
_repetitions [10], we replace ALIGOT loop detection with_
our implementation of the Kolpakov/Kucherov algorithm [11],
which finds all maximal repetitions in a string in linear time.
Our optimization lifts the maximum practical trace size from
tens of thousands of instructions to a few millions.
_2) Candidate Matching:_ This step executes a candi_date matching procedure which receives as input the set of_
candidate encryption instances C from the previous step,
and determines whether each candidate corresponds to the
execution of an actual cipher. Formally, candidate matching
computes a mapping C → _K∪{⊥}, where K is a set of known_
ciphers, while ⊥ represents a failed match. Furthermore, the
procedure determines the set of inputs (keys, plaintexts) and
outputs (ciphertexts) for each instance. The output of the
procedure is a set of matched candidates M, where each
element in M is a tuple (D, k, R[M] _, W_ _[M]_ ). D ⊆ _T is the_
sequence of instructions in the candidate encryption instance;
_k ∈_ _K is the encryption cipher being used, R[M]_ is the set of
inputs passed to the cipher, and W _[M]_ its set of outputs.
It may be possible to implement candidate matching using


several possible approaches, such as [5], [12]. We again use the
approach proposed by ALIGOT for the same reasons discussed
previously. It works by determining all possible input and
output parameters to/from the candidate encryption instance,
and verifying if any input/output combination matches that of
a known cipher. In our implementation we deploy an optimization, by filtering parameters (such as pointers) unlikely
to constitute either key material, ciphertext or plaintext.
_3) Candidate Filtering: This step consists of a candi-_
_date filtering procedure which receives as input the set of_
matched candidates M emitted by the previous step, and the
system call trace S. candidate filtering eliminates elements in
_M whose output is not used to generate network messages._
Recall that each system call in S is described by a set R[S]

of memory locations and registers used to specify the data to
send on the network. A candidate m = (D, k, R[M] _, W_ _[M]_ ) is
retained only if ∃R[S] _∈_ _S, s.t. O = (W_ _[M]_ _∩_ _R[S]) ̸= ∅_ and
no other instruction writes the location(s) in O between the
execution of m and the execution of the system call R[S]. The
output is a filtered set of encrypted instances F, with F ⊆ _M_ .
We note that there is still potential for false positives
from network-bound encrypted data generated by the malware
process for non-C&C related purposes. These could be avoided
by discarding flows which match known protocols. We leave
this additional filtering to future work.
_4) Key Type Inference/Decryption: The next goal is to_
infer general information about how the malware performs
encryption. Malware use of ciphers is oftentimes closer to
obfuscation than proper encryption [3], employing symmetric
ciphers and keys that are either static or provided in the
messages. This implies that —once the encryption details have
been learned—it is possible to decrypt any C&C message
within the protocol, not only the ones observed in the sandbox.
We first determine where the outputs of encryption instances
in the set F appear in the sandbox-generated network packets
from the trace N, by matching the output of encryption instances to packets. We then apply the following two heuristics:
**_Key type inference: If the key is constant across all observed_**
_message samples, then the C&C protocol is assumed to use_
_the very same key for all possible messages. If the key appears_
_at a constant offset within observed message samples, then the_
_key is assumed to be located at that offset in every message._
**_Decryption: If the observed encryption instances cover full_**
_message payloads, the C&C protocol is assumed to fully_
_encrypt each message. If the observed encryption instances_
_only partially overlap with message payloads, it is assumed_
_that the malware encrypts only certain message fields._
After executing this step, the analyst can provide additional
message samples, in order to augment the dataset for protocol
analysis. If encryption was detected, all available messages get
decrypted. The only exception to this is when the malware
only encrypts certain fields. If so, the algorithm has not yet
accrued enough information to determine which parts of each
message may be encrypted (beyond the ones discovered during
trace analysis). In this case, decryption is performed later on
a per-field basis, as part of field inference (§ III-C2).


-----

|Rule target|Details|
|---|---|
|String|Detect sequences of printable characters.|
|IPv4|Match message payload to an IP blacklist.|
|Timestamp|Detect 4-byte POSIX timestamps (must describe a date within a week of when the message was sent).|
|PE|Locate Windows binaries with a PE parser.|
|ZIP|Locate ZIP files. Identify ZIP sections by magic number and merge contiguous sections.|
|Magic|Detect magic numbers. Looks for a constant prefix appearing in every message.|
|Padding|Detect message padding (an iterated constant se- quence trailing every message).|


TABLE I: Content-field detection rules

Formally, the output of this step is a set of decrypted message samples S = s1, ..., sn. (see Fig. 2(a) for an example).

_C. Protocol Analysis_

The goal of the next part is to analyze decrypted message
samples to infer a protocol specification P . For the purpose of
our work we define P as a set of n message types T1, ..., Tn.
Each Ti is in turn is a sequence of m field types t1, ..., tm.
_1) Message Clustering: This step determines a partition of_
the sample set S, associating each sample with exactly one
message type. We use the insight that most protocols include
one or more fields that explicitly specify the message type. We
therefore detect likely candidates for message type fields, and
cluster messages based on the value of those fields. In order to
detect type fields without any prior information about message
structure, we use a heuristics due to Bermudez et al. [13],
which works by measuring causality between candidate fields
in pairs of client request/server response messages (the values
of such fields are highly likely to be related). We then assign
each message to a type based on the value of its type fields.
_2) Field inference: After messages have been clustered by_
type, we proceed to perform field inference. The overall goal
is to determine a partition of each message sample s ∈ _S in_
a sequence of fields f1, ..., fn. Of course, for each field fi we
also want to obtain the corresponding field type ti.
Our approach is based on a taxonomy of binary protocol
fields in four broad categories. The first includes content
_fields, i.e. fields that carry various type of information (IP_
addresses, executables, etc.) between client and server. The
second category includes dependent fields, i.e. fields that describe properties of other fields, such as their offset or length.
The third category consists of encrypted fields, i.e. content and
dependent fields that are encrypted. The third category occurs
in C&C protocols that perform partial message encryption. We
also define composite fields, consisting of repeated sequences
of basic fields types (e.g. a list of IP addresses).
The core idea of our field inference approach is to detect
content fields using a rule-based technique. Once content
fields have been located, the algorithm searches for dependent
fields. Furthermore, if the protocol performs partial message
encryption, the algorithm analyzes the message looking for
signs of encrypted content, which is then decrypted to extract
additional fields. Finally, the algorithm analyzes the sequence
of all discovered fields to identify composite fields.


**Content-field inference. The assumption behind this pass is**
that the structure of many types of content fields is selfevident, i.e. it can be detected by a rule-based approach. For
example, consider the example message in Fig. 1(b) after
Step C.1. The EXE file field detected in Step C.2 has a selfevident structure: a well-defined format recognizable by a PE
file parser. We implemented a library of detection rules—
detailed in Table I—for various content types, based on our
experience with C&C protocols. In order to identify content
fields, the detector runs all available rules on message samples.
Rules receive a set of samples and output the detected fields,
with no constraints on how they are implemented. We found
adding rules to be straightforward: developing the ZIP detector
took us two hours with no previous familiarity with the format.
**Dependent-field inference. This step breaks each message**
segment not already covered by a content field in 1-, 2and 4-byte N -grams. It then checks if the value of each
_N_ -gram matches a property of an existing content field or
of the message. In the example of Fig. 1(b), Step C.3, two
dependent fields are detected: CRC and LENGTH. Table II
lists the properties each N -gram is checked against.
**Encrypted-field inference. A special case of fields are en-**
crypted fields in partially encrypted protocols. In order to
identify such fields within the rest of a message, we devise
a technique based on the observation that the ciphertext of
an encrypted field appears as a random bitstring, while the
corresponding plaintext exhibits some type of structure.
We use a randomness test by Goubault-Larreq and Olivain [14]. Briefly, the test works by estimating the Shannon entropy HN[MLE] of a sequence of N bytes using the
maximum-likelihood estimator, and comparing it with the
expected value HN of the N -truncated entropy of a uniformly random sequence of the same length. If the distance
is less than a predefined threshold, the sequence of bytes
under examination is judged random (see [14] for details).
Consider a procedure randomness test that performs such
a test, and a message M . For each possible byte offset
_o = 1...n in the message and for each possible field length_
_l, our detector computes randomness test(M_ [o : o + l])
and randomness test(D(M [o : o + l])) (where D is the
decryption function). Whenever the original byte sequence is
random—according to the test—but the decrypted one is not,
the decrypted value is passed to the content- and dependentfield detectors.
**Composite-field inference. Protocol specifications may in-**
clude set semantics [15], i.e. repeated sequences of one or
more field types; different messages of the same type may
carry sequences of different length. Our approach therefore
incorporates a detector—based on the Kolpakov/Kucherov
algorithm (§ III-B1)—to find repetitions in the sequence of
discovered fields. Sequences of different lengths but carrying
the same type(s) are mapped to the same composite field type.
Overall, the field inference steps output a set of messages,
each labeled with a set of discovered fields (e.g. see Fig. 2(b)).
_5) Specification Abstraction: The final step processes all_
messages in each type cluster, isolating the commonalities


-----

|Rule target|Details|
|---|---|
|Length|Detect fields whose value F satisfies F = aL −b for some a, b (L is a field or message length).|
|Offset|Detect fields whose value F satisfies F = aO for some a (O is the offset of a known content field).|
|CRC|Detect fields whose value matches H(M) (M is the message content and H a known hash function).|


TABLE II: Dependent-field detection rules

between sequences of fields in different messages, and abstracting a message-type specification T (recall that such
specification consists of a sequence of field types t1, ..., tm).
This step is necessary to reconcile inconsistencies in the fields
detected in different messages of the same type, which arise
for two reasons. First, message clusters determined in Step C.1
may be imprecise, or a protocol may use different message
formats within the same declared message type. Second, rulebased heuristics may not detect all occurrences of a field.
In order to merge multiple field sequences in a single
specification, we use the Needleman-Wunsch algorithm [7]—a
dynamic programming technique to find the optimal alignment
of similar sequences, often used for message comparison and
clustering (e.g. [15], [16]). A significant issue is that the
coverage of message samples may not be complete, as each
message may include one or more sections in which the
field inference step was not able to detect any field. When
performing alignment, we allow sequences to include blanks
representing one or more unknown field. We treat blanks as
wildcards, allowing them to align with any concrete field type.
For each message type, we choose the message sample
for which the richest set of fields has been inferred, and we
generate a specification based on that. Then, we iteratively
align the field sequence in the specification with the field
sequences in all other messages of the same type. Every time
the two sequences align, we “fuse” them in a single sequence
specification (if a blank aligns with a concrete field type, the
merged specifications only includes the latter). If a cluster
contains messages that do not align with the specification, we
split it and repeat specification abstraction on both the newly
created clusters, recursively splitting them again if necessary.
The final output is the protocol specification (e.g. Fig. 2(c)).

_D. Signature Generation_

One relevant use for malware protocol specifications is the
generation of network signatures. We therefore also propose a
methodology to generate deterministic signatures, which verify
an ordered set of tests, each of which checks the presence
of an expected field. We declare a match if all the (ordered)
tests in a signature are satisfied. The tests are rules that check
particular field characteristics in a given group of contiguous
bytes, defined by an offset and length. We develop tests for
all the content-field detectors in Table I. For example, the
_MATCH STRING test will check for the existence of only_
printable characters in the group. In addition, we developed
tests to check for the dependent fields in Table II. For instance,
_MATCH MSG LENGTH will check that the message length_
is equal to the value of the input byte group.


IV. EVALUATION

We evaluate our approach in depth showing that it can:
(i) correctly identify encryption uses, (ii) effectively infer the
structure of C&C messages, (iii) generate correct protocol
specifications, and (iv) lead to effective network signatures.

_A. Malware Families under Examination_

For our evaluation, we selected three malware families
based on: (i) the level of understanding of the C&C protocol
by the research community, so to have a well-defined ground
truth; and (ii) the popularity of the malware in recent times.
For each family we collected and run one binary sample. The
families are: Sality [17], a polymorphic file infector and downloader with complex features, such as P2P support (we focus
on the Sality.AE variant). Ramnit [18], a malware platform
with various capabilities, provided through a flexible plugin
architecture. Ramnit’s botnet is estimated to have included
3M+ infected machines at his peak [19]. ZeroAccess [6], a
trojan that can act as a backdoor and download additional
malware. It uses P2P to participate in botnets extending to
tens of thousands of infected machines [20].

_B. Infrastructure_

Our sandbox implementation (Step A in Fig 1(a)) is based
on the Cuckoo sandbox [21], extended with a custom analysis
package based on PIN [22] to generate instruction traces[1].
The sandbox output (instruction trace, syscall trace and
network trace) is passed to our encryption analysis tool
(Steps B.1-B.4). The candidate detection and candidate matching steps in our algorithm are based on the publicly available
ALIGOT Python implementation [23], modified to introduce
our optimizations. Separately, the tool analyzes the sandboxgenerated network trace to extract malware messages, and
matches extracted messages with detected encryption uses; if
possible each message is then decrypted (Step B.4). This step
returns a list of message samples annotated with encryption
information (Fig. 2(a)).
A second application implements message clustering, field
inference, and abstraction (Steps C.1-C.5). The output consists
of (i) a list of messages, each augmented with a list of detected
fields (Fig. 2(b)), and (ii) specifications for each detected
message type (Fig. 2(c)). Overall, our toolchain consists of
17500 lines of Python code and 1900 lines of C++.

_C. Additional Datasets_

In order to augment the diversity of our C&C dataset beyond
the messages obtained from the sandbox, we integrated additional message datasets, detailed in Table III (user-generated
traces were anonymized). To extract C&C messages from the
network traces, we used a combination of Snort [24], custom
scripts and manual inspection. We emphasize that message
extraction was a non-trivial effort, spawning several weeks of
analyst’s time. Benign traffic samples used for our signature
matching evaluation (§ IV-H) are also detailed in Table III.

1We used industry-standard tools for VM tuning and hardening (e.g. pafish)
to circumvent malware anti-VM techniques.


-----

**ORIGIN: Ramnit** **ORIGIN: Ramnit** **MESSAGE_TYPE: "E2-CS-0"**
**L4 PROTOCOL: TCP** **PAYLOAD: 00FF4B000000E20020000(...)** **FIELD #1:** _type="MAGIC" length="2" content="00FF"_
**SOURCE IP: 192.168.1.1** **FIELDS INFORMATION:** **FIELD #2:** _type="MSG_LENGTH" length="4"\_
**SOURCE PORT: 49197** **FIELD #1:** _type="MAGIC" position="0-1"_ _content="4B000000"_
**DESTINATION IP: 192.168.1.2** **FIELD #2:** _type="MSG_LENGTH" position="2-5"_ **FIELD #3:** _type="MSG_TYPE" length="1" content="E2"_
**DESTINATION PORT: 443** **FIELD #3:** _type="MSG_TYPE" position="6"_ **FIELD #4:** _type="BLANK" length="1" content="00"_
**DIRECTION: Client → Server** **FIELD** **#4: type="FIELD_LENGTH"\** **FIELD #5:** _type="FIELD_LENGTH" refers-to="6"\_
**TIMESTAMP: 1435260024.859** _position="8-11" refers-to="5"_ _length="4" content="20000000"_
**PAYLOAD: 00FF4B000000E20020000(...)** **FIELD** **#5:** _type="ENCRYPTED_RANGE"\_ **FIELD #6:** _type="ENCRYPTED_RANGE"length="32"\_
**ENCRYPTION INFORMATION:** _position="12-43"_ _content_type="variable"_
**USE #1:** _cipher="RC4" key="626C61636B"\_ **FIELD #5.1:** _type="STRING"\_ **FIELD #6.1:** _type="STRING" length="32"\_
_length="32" offset="49"\_ _position="12-43"_ _content_type="variable"_
_plaintext="353437353563(...)"_ **FIELD #6: type="FIELD_LENGTH"\** **FIELD #7:** _type="BLANK" length="1" content="00"_
**USE #2:** _cipher="RC4" key="626C61636B"\_ _Position ="45-48" refers-to="7"_ **FIELD #8:** _type="FIELD_LENGTH" refers-to="9"\_
_length="32" offset="12"\_ **FIELD #7:** _type="ENCRYPTED_RANGE"\_ _length="4" content="20000000”_
_plaintext="306263356665(...)"_ _position="49-80"_ **FIELD #9:** _type="ENCRYPTED_RANGE" length="32"\_

**FIELD #7.1: type="STRING"\** _content_type="variable"_
_position="49-80"_ **FIELD #9.1:** _type="STRING" length="32"\_
**MESSAGE TYPE: E2** _content_type="variable"_

**(a) Output of decryption (step B.4)** **(b) Output of field inference (C.1-C.4)** **(c) Output of specification abstraction (C.5)**

Fig. 2: Examples of output of various algorithmic steps for the Ramnit C&C protocol

|Traces used to extract C&C samples|Col2|Col3|
|---|---|---|
|Trace|Size|Type|
|ISP|635GB (330M flows)|User-generated traffic from large European ISP|
|MTA|386MB (17K flows)|Malware-generated traffic from controlled environment [25]|
|Traces containing benign traffic|||
|DARPA|3.4GB (858K flows)|DARPA corpus [26] (week #2)|
|M57|4.6GB (93K flows)|Corpus for forensics training [27]|


TABLE III: External traces used in the evaluation

_D. Encryption Analysis_


In this section, we evaluate how effective is our encryption
analysis algorithm in: (i) detecting the use of encryption, and
(ii) inferring details of the encryption used in C&C protocols.
In order to evaluate the detection part, we define an encryp_tion instance as a contiguous block of message bytes encrypted_
with the same key. We ran the malware binaries, collected
instruction traces and fed them to our analysis infrastructure
(Steps B.1-B.3). We then compared the number of detected
instances to the true number determined via manual analysis.
Table IV, col. 2 shows that our approach detects all encryption
instances for all malware families. As repeated experiments led
to the same outcome, we present the result of a sample run.
We then fed the detected encrypted instances and the
sandbox-generated network traces to our encryption inference
algorithm (Step B.4). In all cases, the tool mapped instances
to C&C messages, and correctly inferred encryption details
of interest (see Table IV col. 3-5). Overall, results in this
section suggest that our approach is effective in detecting and
assessing uses of encryption in C&C protocol messages.

_E. Message Clustering/Inference_

This part of our evaluation investigates the effectiveness of
our approach in: (i) clustering C&C messages by type, and
(ii) detecting various field types within messages. In this step,
we integrated the set of malware messages obtained from
the sandbox with samples from the traces in Table III. The
aggregated dataset consists of 20615 Sality messages, 1237
Ramnit messages and 49378 ZeroAccess messages.
**Clustering: This step detects message-type fields and parti-**
tions the message set based on these fields. For Sality, our


trace only includes client-to-server messages, which prevents
the message-type heuristic (§ III-C1) from being applied. All
messages are therefore mapped to a single type, which is
correct, as all messages are of the pack exchange variety (used
by peers to spread IPs of other bots). In the case of Ramnit and
ZeroAccess full conversations are available. In both cases the
heuristics isolates the relevant message-type bytes, identifying
_23 of 29 true message types for Ramnit, and 4 out of 4 true_
_message types for ZeroAccess. The discrepancy in Ramnit_
message types is due to messages with the same declared type
but different structure, perhaps reflecting protocol evolution.
**Field inference: This step analyzes messages in each type**
cluster. Rule-based detectors are used to identify various types
of field (ref. Tables I and II). We define two metrics to assess
the effectiveness of field inference: for a set of samples of a
given type, field coverage is the fraction of correctly-inferred
fields, while byte coverage is the fraction of bytes within
correctly inferred fields. We consider a field to be correctly
inferred if its type/position align with that of a true field.
We note that the C&C protocols evaluated include several
field types not supported by our approach. This is because
we only support fields that map to generic data types (e.g.
string, IPv4, CRC) and ignore malware-specific data types
(e.g. ZeroAccess messages carry binary values expressing the
age in seconds of peer IP addresses).
Fig. 3(a) presents field and byte coverage of the set of
supported fields for each message type (i.e. the baseline
only includes fields whose type is supported by our detectors). For Ramnit we only present aggregate results—
average/min/max—as this malware defines 29 distinct types.
Type names are taken from protocol specifications.
In general, our approach detects the majority of supported
fields/bytes. The limited coverage of certain ZeroAccess message types is due to timestamps which predate their respective
messages by more than one week, causing the timestamp
detector to ignore them. Sality’s undetected content consists
of message-type fields, that cannot be identified since only
client-side messages are available. Note that that most message
types contain < 10 fields, therefore even a limited number
of undetected fields can greatly affect the result. Overall we
achieve 97% supported-field coverage (amounting to 97% of


-----

```
1.0
0.8

```
```
1.0
0.8

```
```
0.6
0.4

```
```
0.6
0.4

```
```
0.2
0.0

```
|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12|Col13|Field Byte|Col15|Col16|Covera coverag|Col18|Col19|ge e|Col21|Col22|Col23|Col24|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||

|Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12|Col13|Field Byte|Col15|Col16|Covera coverag|Col18|Col19|ge e|Col21|Col22|Col23|Col24|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||
|||||||||||||||||||||||||

```
Sality
 URLs

```
```
Sality
 URLs

```
```
 ZA
getF

```
```
 ZA
setF

```
```
 ZA
retL

```
```
 ZA
getF

```
```
 ZA
setF

```
```
 ZA
retL

```
```
 ZA
getL

```
```
 ZA
getL

```
```
 ZA
srv?

```
```
 ZA
yes!

```
```
0.2
0.0

```
```
 ZA
srv?

```
```
 ZA
yes!

```
```
Ramnit
(all)

```
```
Ramnit
(all)

```

(a) Coverage of supported field types (b) Overall coverage

Fig. 3: Coverage of supported field types and of full message content for each message type

|Malware|Instances found/present|Key material|Cipher|Encryption type|
|---|---|---|---|---|
|Sality|64/64|Carried in message|RC4|full message|
|Ramnit|32/32|Constant|RC4|partial(some fields only)|
|ZeroAccess|120/120|Constant|RC4|full message|


TABLE IV: Details of the encryption used by each malware

bytes in those fields), with null or negligible false positive rate.
Fig. 3(b) presents overall coverage of all message content,
including fields for which no detector is available. The large
majority of undetected fields/bytes is taken up by ZeroAccess’
peer-age fields described above. Such semantics are malwarespecific and cannot be captured by generic rules.
In summary, results show that our approach can identify a
significant fraction of the structures carried by C&C messages.

_F. Specification Abstraction_


2.50 GHz processors and 64 GB of RAM. We do not present
sandbox execution times as they depend on the malware
under analysis (e.g. our Sality binary waits 10 minutes before
generating traffic). We emphasize that although our current
implementation is single-threaded, each step presents abundant
opportunities for parallelism. For example, analysis of distinct
messages could be largely performed in parallel.

_H. Signature Generation and Matching_


In order to evaluate the correctness of auto-generated specifications we compare them to manually-generated specifications based on domain knowledge and analyses of the C&C
protocols ( [17], [28], [29]). Results are presented in Table V.
Column #3 in the table shows the number of inferred types
in each case. In most cases, there is a 1:1 mapping between
inferred and actual types; the exception is the ZeroAccess retL
message type, for which our algorithm abstracts 2 different
message sub-types. Messages of this type carry sequences
of (Peer IP address, Peer age) pairs. The composite-field
heuristics detect a sequence of fields only if it includes three
or more elements; hence, sequences of length 2 are mapped
to a dedicated field type.
Column #4 lists the number of supported field types in
the specification of each message type. Columns #5 describes
how many supported fields were correctly recognized for each
message type. Column #6 details false positives: fields that
were inferred but are not present in the protocol specification.
Overall, our approach produces informative specifications,
consistent with the structure of the protocol under analysis.

_G. Performance_


For each of the C&C message sets, we generated (using
the methodology described in § III-D) and tested signatures
using 5-fold cross-validation (4 partitions for training and 1
for testing in every iteration). The signature generation and
matching logic is implemented as a set of PERL scripts,
totaling 700 lines of code. This code receives as input (i)
protocols specifications generated by our toolchain, and (ii) a
set of messages extracted from network traffic. Each message
is then either matched to a malware protocol specification, or
discarded as benign if no match is found.
Table VII shows, for the three malware families, the average
percentage of messages of a given family that matches the
signatures. We also show a column for Others, which include
all the traffic from the malware-free traces DARPA and M57.
These traces consist of a mix of popular protocols (DNS,
HTTP) and unknown UDP and TCP traffic. Overall, our results
show no false positives and very low false negatives.

V. DISCUSSION


Table VI details the performance of all steps of our algorithm when running on a Linux Server with Intel Xeon


**Per-session encryption and TLS: In order to decrypt mes-**
sage samples from external sources, our approach requires that
encryption key and/or the decryption approach inferred from
sandbox-originated traces, are valid for any protocol message.
It is in principle possible to design protocols that generate permessage keys which are non-trivial to infer, or use asymmetric
encryption. One case of particular interest within this category
is that of protocols using TLS. First, we note that as long as the
cipher can be detected, our approach can be extended to extract
the plaintext for messages generated/received by the sandbox.
However, since each connection dynamically negotiates unique
keys, it is no longer possible to infer a botnet-wide encryption
key. Decryption may still be achieved via protocol-specific
approaches; e.g. decrypt TLS connections via TLS proxying.


-----

|Malware|Msg types (ground truth)|#Msg types (inferred)|#Supported fields (ground truth)|#Supported fields (correctly inferred)|#Supported fields (false positives)|
|---|---|---|---|---|---|
|Sality|URLs|1|4|3|0|
|Ramnit|29 types|28|168|124|6|
|ZeroAccess|getF|1|5|4|0|
||setF|1|5|4|0|
||srv?|1|10|5|0|
||yes!|1|10|4|0|
||getL|1|5|4|0|
||retL|2|266|262|2|


TABLE V: Breakdown of inferred specifications


In our work, we observed that malware commonly tends
to use simple forms of symmetric encryption, and to reuse
keys. For example, our C&C dataset for the popular Ramnit
malware shows that the same key was used in 2012 and 2014.
**Text-based protocols: Our approach focuses on binary-**
only protocols due to their popularity, however it is possible
for malware to use text-based protocols. Although supporting
them is outside the scope of this work, we note that the
encryption analysis component can still be applied.
**Session semantics: Our current analysis does not inves-**
tigate how multiple messages are organized into sessions.
Although inferring the protocol state machine (such as in [30])
is outside the scope of our work, our approach could be easily
extended to infer various types of session-related information.
In particular, our approach partitions each malware connection
into an ordered sequence of client and server messages,
enabling inference of the conversation structure (e.g. Client
_messages of type A are always followed by server messages of_
_type B). It would be similarly possible to detect session keys—_
fields that are common across requests and responses—e.g.
using the heuristics from [13]. We leave this as future work.

VI. RELATED WORK

**Protocol reverse engineering: Techniques proposed for pro-**
tocol reverse-engineering can be divided into three groups,
depending on their input data. The first one generates protocol specifications from network traces. For instance, ScriptGen [31] generates a state machine from the network traffic
observed, in order to produce scripts that approximate responses to different protocol requests, while [15] automatically
infers message formats using a few very generic pre-defined
field semantics. None of these approaches targets binary
protocols; furthermore, we infer detailed specifications which
include rich field types, and we deal with encryption while
previous works do not.
The second group of works in this space includes techniques that analyze execution traces captured as a program is
communicating over the network. [32] presents a technique
to generate specifications by instrumenting a program during
message processing. The approach focuses on server-side
code, which in the malware world is notoriously difficult to
access. Dispatcher [33] does automatic reverse-engineering
of C&C protocols by performing dynamic analysis on the
malware binary. By tainting the incoming data from the
network, the authors can derive detailed semantics of fields


after identifying the prototype of the function in which the
data will be used. For outgoing data, the authors employ
a series of heuristics to determine the position, size and
semantics of fields. The authors also consider protocols that
use encryption, by leveraging existing techniques [34], where
plain-text data is extracted from the input (output) buffers to
encryption (decryption) functions. We differ from Dispatcher
in two important ways: (i) we can learn and generalize
encryption details, enabling analysis of samples not generated
in the sandbox, and signature generation and online traffic
decryption; (ii) we only require lightweight, passive binary
analysis which can be avoided if a malware from the same
family has been analyzed in the past.
The third group of works follows a hybrid approach.
Prospex [35] analyzes both network traffic and execution traces
to infer the malware’s protocol state machine. However, [35]
does not handle encrypted traffic.
**Botnet C&C traffic detection: Several approaches in litera-**
ture focus on generating unique patterns or signatures to identify C&C protocol communications. ProVex [3] is probably
closest to our work. It heuristically attempts to decrypt packets
using known encryption algorithms and extracts a statistical
profile of the byte distribution in the payload. Differently from
our approach, it requires previous knowledge of encryption
keys. In addition, its signatures are probabilistic and thus
potentially prone to higher false positives, while we can build
very detailed signatures based on field semantics. Botzilla [36]
works similarly but does not perform decryption, assuming
instead that recurring textual features can be found even within
encrypted messages. CoCoSpot [37] also targets C&C protocols and deals with encryption traffic by using features that are
orthogonal to payload obfuscation, such as message lengths.
FIRMA [38] generates malware signatures that primarily target HTTP-based C&C communication. Similarly, Perdisci et
al. [39] uses URL-based features to cluster malware-derived
HTTP samples and generate effective signatures (as opposed
to our approach, which focuses on binary messages). Both

[38], [39] rely on deriving sets of constant tokens via string
analysis, which is not possible in the presence of encryption.
**C&C Server Fingerprinting: These kinds of works try to**
identify C&C servers by analyzing their response to carefully
constructed probes. Two recent works in this area are [40],

[41]. These works aim at generating probes that elicit a
meaningful answer from the server, while we aim at inferring
full specifications for message formats to enable detection.


-----

|Malware|#Instructions in program trace|Trace analysis|#Msg samples in dataset|Avg. #fields per message|Enc. inference/ Msg decryption|Clustering|Field inference|Specification abstraction|
|---|---|---|---|---|---|---|---|---|
|Sality|67M|103min|20615|6|2s|2s|8s|8s|
|Ramnit|13M|16min|1237|7|<1s|<1s|5min|<1s|
|ZeroAccess|19M|20min|49378|131|3s|44s|73min|36s|


TABLE VI: Dataset sizes and execution times of algorithmic steps

|Malware|Average % of Messages Detected per Signature Sality Ramnit ZeroAccess Others|Col3|Col4|Col5|
|---|---|---|---|---|
|Sality|100|0|0|0|
|Ramnit|0|96.4|0|0|
|ZeroAccess|0|0|99.9|0|


TABLE VII: Traffic classification results

VII. CONCLUSION

In this paper we present a solution to infer the format of
malware binary C&C protocols. Understanding such protocols
is crucial to gain insight into how malware works, and to build
effective network-based malware detectors. Our inference algorithm produces detailed protocol specifications, including
rich content types—significantly alleviating the cumbersome
and error-prone task of understanding malware communications manually. Our approach works in the presence of
encryption, using program analysis to recover encryption keys.
Evaluation results show that our approach is effective in decrypting malware samples and infers rich specifications, which
generate effective network signatures for C&C protocols.

REFERENCES

[1] “Internet Security Threat Report, Volume 20,” Symantec, Tech. Rep.,
Apr. 2015.

[2] “Safeguarding The Internet: Level 3 Botnet Research Report,” Level 3
Communications, Tech. Rep., Jun. 2015.

[3] C. Rossow and C. J. Dietrich, “Provex: Detecting botnets with
encrypted command and control channels,” in DIMVA, 2013.

[4] J. C. Mitchell, Foundations of Programming Languages. Cambridge,
MA, USA: MIT Press, 1996.

[5] J. Calvet, J. M. Fernandez, and J.-Y. Marion, “Aligot: Cryptographic
Function Identification in Obfuscated Binary Programs,” in CCS, 2012.

[6] J. Wyke, “The ZeroAccess Botnet - Mining and Fraud for Massive
Financial Gain,” Sophos, Tech. Rep., Sep. 2012.

[7] S. B. Needleman and C. D. Wunsch, “A general method applicable to
the search for similarities in the amino acid sequence of two proteins,”
_Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453, Mar. 1970._

[8] R. Zhao, D. Gu, J. Li, and R. Yu, “Detection and analysis of cryptographic data inside software,” in ISC, 2011.

[9] F. Gr¨obert, C. Willems, and T. Holz, “Automated Identification of
Cryptographic Primitives in Binary Programs.” in RAID, 2011.

[10] M. Lothaire, Applied combinatorics on words. Cambridge University
Press, 2005.

[11] R. Kolpakov and G. Kucherov, “Finding maximal repetitions in a word
in linear time,” in FOCS, 1999.

[12] P. Lestringant, F. Guih´ery, and P.-A. Fouque, “Automated identification
of cryptographic primitives in binary code with data flow graph isomorphism,” in ASIA CCS, 2015.

[13] I. Bermudez, A. Tongaonkar, M. Iliofotou, M. Mellia, and M. M.
Munafo, “Automatic Protocol Field Inference for Deeper Protocol
Understanding,” in IFIP, 2015.

[14] J. Goubault-Larrecq and J. Olivain, “Detecting Subverted Cryptographic
Protocols by Entropy Checking,” ENS-Cachan, Tech. Rep. LSV-06-13,
2006.

[15] W. Cui, J. Kannan, and H. J. Wang, “Discoverer: Automatic protocol
reverse engineering from network traces,” in USENIX, 2007.

[16] M. A. Beddoe, “Network protocol analysis using bioinformatics
algorithms,” Tech. Rep., 2004.


[17] N. Falliere, “Sality: Story of a peer-to-peer viral network,” Symantec,
Tech. Rep., Jul. 2011.

[18] Symantec Security Response, “W.32 Ramnit Analysis,” Symantec,
Tech. Rep., Feb. 2015.

[19] “Europol cracks down on botnet _|_ Ars Technica.” [Online].
Available: http://arstechnica.com/tech-policy/2015/02/europol-cracksdown-on-botnet-infecting-3-2-million-computers/

[20] “ZeroAccess botnet resumes click-fraud activity after sixmonth break _|_ Dell SecureWorks Blog.” [Online]. Available: http://www.secureworks.com/resources/blog/zeroaccess-botnetresumes-click-fraud-activity-after-six-month-break/

[21] “Cuckoo Sandbox.” [Online]. Available: http://cuckoosandbox.org/

[22] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney,
S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Building Customized
Program Analysis Tools with Dynamic Instrumentation,” in PLDI,
2005.

[23] “Aligot source code,” May 2016. [Online]. Available: https://code.
google.com/archive/p/aligot/

[24] “Snort IDS,” May 2016. [Online]. Available: http://www.snort.org/

[25] “Malware-Traffic-Analysis.net.” [Online]. Available: http://malwaretraffic-analysis.net/about.html

[26] R. Lippmann, J. W. Haines, D. J. Fried, J. Korba, and K. Das,
“Analysis and results of the 1999 DARPA off-line intrusion detection
evaluation,” in RAID, 2000.

[27] “M57-Patents Scenario,” May 2016. [Online]. Available: http://
digitalcorpora.org/corpora/scenarios/m57-patents-scenario

[28] C. Chen, “Virus Bulletin: Ramnit bot,” Nov. 2012. [Online]. Available: https://www.virusbtn.com/virusbulletin/archive/2012/
11/vb201211-Ramnit.dkb

[29] K. McNamee, “Malware analysis report - botnet: ZeroAccess/Sirefef,”
Kindsight, Tech. Rep., Jul. 2014.

[30] C. Y. Cho, E. C. R. Shin, D. Song, and others, “Inference and analysis
of formal models of botnet command and control protocols,” in CCS,
2010.

[31] C. Leita, K. Mermoud, and M. Dacier, “ScriptGen: an automated script
generation tool for Honeyd,” in ACSAC, 2005.

[32] G. Wondracek, P. M. Comparetti, C. Kruegel, and E. Kirda, “Automatic
Network Protocol Analysis,” in NDSS, 2008.

[33] J. Caballero, P. Poosankam, C. Kreibich, and D. Song, “Dispatcher:
Enabling active botnet infiltration using automatic protocol reverseengineering,” in CCS, 2009.

[34] Z. Wang, X. Jiang, W. Cui, X. Wang, and M. Grace, “ReFormat:
Automatic reverse engineering of encrypted messages,” in ESORICS,
2009.

[35] P. M. Comparetti, G. Wondracek, C. Kruegel, and E. Kirda, “Prospex:
Protocol Specification Extraction,” in IEEE S&P, 2009.

[36] K. Rieck, G. Schwenk, T. Limmer, T. Holz, and P. Laskov, “Botzilla:
detecting the phoning home of malicious software,” in SAC, 2010.

[37] C. J. Dietrich, C. Rossow, and N. Pohlmann, “CoCoSpot: Clustering
and recognizing botnet command and control channels using traffic
analysis,” Computer Networks, vol. 57, no. 2, pp. 475–486, Feb. 2013.

[38] M. Z. Rafique and J. Caballero, “Firma: Malware clustering and
network signature generation with mixed network behaviors,” in RAID,
2013.

[39] R. Perdisci, W. Lee, and N. Feamster, “Behavioral Clustering of
HTTP-Based Malware and Signature Generation Using Malicious
Network Traces.” in NSDI, 2010.

[40] A. Nappa, Z. Xu, J. Caballero, and G. Gu, “CyberProbe: Towards
Internet-Scale Active Detection of Malicious Servers,” in NDSS, 2014.

[41] Z. Xu, A. Nappa, R. Baykov, G. Yang, J. Caballero, and G. Gu,
“Autoprobe: Towards automatic active malicious server probing using
dynamic binary analysis,” in CCS, 2014.


-----