# The Tangled Genealogy of IoT Malware


## Emanuele Cozzi
#### emanuele.cozzi@eurecom.fr EURECOM Sophia Antipolis, France

## Yun Shen
#### yun.shen@nortonlifelock.com NortonLifeLock, Inc. Reading, United Kingdom

### ABSTRACT


## Pierre-Antoine Vervier Matteo Dell’Amico
#### France della@linux.it France


## Leyla Bilge
#### leylya.yumer@nortonlifelock.com NortonLifeLock, Inc. Sophia Antipolis, France


## Davide Balzarotti
#### davide.balzarotti@eurecom.fr EURECOM Sophia Antipolis, France


The recent emergence of consumer off-the-shelf embedded (IoT)
devices and the rise of large-scale IoT botnets has dramatically increased the volume and sophistication of Linux malware observed
in the wild. The security community has put a lot of effort to document these threats but analysts mostly rely on manual work, which
makes it difficult to scale and hard to regularly maintain. Moreover,
the vast amount of code reuse that characterizes IoT malware calls
for an automated approach to detect similarities and identify the
phylogenetic tree of each family.
In this paper we present the largest measurement of IoT malware
to date. We systematically reconstruct – through the use of binary
code similarity – the lineage of IoT malware families, and track
their relationships, evolution, and variants. We apply our technique
on a dataset of more than 93k samples submitted to VirusTotal over
a period of 3.5 years. We discuss the findings of our analysis and
present several case studies to highlight the tangled relationships
of IoT malware.

### CCS CONCEPTS

- Security and privacy → _Software and application security; Mal-_
**ware and its mitigation.**

### KEYWORDS

Malware, IoT, Classification, Measurement, Lineage

**ACM Reference Format:**
Emanuele Cozzi, Pierre-Antoine Vervier, Matteo Dell’Amico, Yun Shen,
Leyla Bilge, and Davide Balzarotti. 2020. The Tangled Genealogy of IoT
Malware. In Annual Computer Security Applications Conference (ACSAC
_2020), December 7–11, 2020, Austin, USA. ACM, New York, NY, USA, 16 pages._
[https://doi.org/10.1145/3427228.3427256](https://doi.org/10.1145/3427228.3427256)

Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
_ACSAC 2020, December 7–11, 2020, Austin, USA_
© 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-8858-0/20/12...$15.00
[https://doi.org/10.1145/3427228.3427256](https://doi.org/10.1145/3427228.3427256)


### 1 INTRODUCTION

Over the last few years we have witnessed an increase in both the
volume and sophistication of malware targeting IoT systems. Traditional botnets and DDoS tools now cohabit with crypto-mining,
spyware, ransomware, and targeted samples designed to conduct
cyber espionage. To make things worse, the public availability of
the source code associated with some of the main IoT malware
families have paved the way for myriads of variants and tangled
relationships of similarities and code reuse. To make sense of this
complex evolution, the security community has devoted a considerable effort to analyze and document these emerging threats, mostly
through a number of blog posts and the definitions of Indicators of
Compromise [7, 8, 36, 44]. However, while the insights gained from
these reports are invaluable, they provide a very scattered view of
the IoT malware ecosystem.
On the academic side, Cozzi et al. [12] provided the first largescale study of Linux malware by relying on a combination of static
and dynamic analyses. The authors studied the behavior of 10K
samples collected between November 2016 and November 2017,
with the goal of documenting the sophistication of IoT malware
(in terms of persistence mechanisms, anti-analysis tricks, packing,
etc.). Antonakakis et al. [3] instead dissected the Mirai botnet and
provided a thorough investigation into its operations, while Pa et
_al. [35] and Vervier et al. [45] used IoT honeypots to measure the_
infection and monetization mechanisms of IoT malware.
Despite this effort, little is still known about the dynamics behind
the emergence of new malware strains and today IoT malware is
still classified based on the labels assigned by AV vendors. Unfortunately, these labels are often very coarse-grained, and therefore
unable to capture the continuous evolution and code sharing that
characterize IoT malware. For instance, it is still unclear how many
variants of the Mirai botnet have been observed in the wild, and
what makes each group different from the others. We also have a
poor understanding of the inner relationships that link together
popular families, such as the Mirai and Gafgyt botnets and the
infamous VPNFilter malware.
This paper aims at filling this gap by proposing a systematic
way to compare IoT malware samples and display their evolution
in a set of easy-to-understand lineage graphs. While there exists a
large corpus of works that focused on the clustering of traditional
malware [5, 6, 25, 29, 38] and exploring their lineage [15, 24, 26, 28,
31, 33] proving the complexity of these problems, in this paper we
show that the peculiarities of IoT malware require the adoption of


-----

ACSAC 2020, December 7–11, 2020, Austin, USA Cozzi, et al.


customized techniques. On the other hand, to our advantage, the
current number of samples and the general lack of code obfuscation
make possible, for the first time, to draw a complete picture that
covers the entire ecosystem of IoT malware.
Our main contribution is twofold. First, we present an approach
to reconstruct the lineage of IoT malware families and track their
evolution. This technique allows identifying various variants of
each family and also the intra-family relationships that occur due
to the code-reuse among them. Second, we report on the insights
gained by applying our approach on the largest dataset of IoT malware ever assembled to date, which include all malicious samples
collected by VirusTotal between January 2015 and August 2018[1].
Our lineage graphs enabled us to quickly discover over a hundred
mislabeled samples and to assign the proper name to those for
which AV products did not reach a consensus. Overall, we identified
and validated over 200 variants in the top families alone, we show
the speed at which new variants were released, and we measured
for how long new samples belonging to each variant appeared on
VirusTotal. By looking at changes in the functions, we also identify
a constant evolution of thousands of small variations within each
malware variant. Finally, our experiments also emphasize how the
frequent code reuse and the tangled relationship among all IoT
families complicate the problems of assigning a name to a given
sample, and to clearly separate the end of a family and the beginning
of another.
We make the full dataset and the raw results available to researchers [2]. We also share the high resolution figures of the lineage
graphs made by architecture for ease of exploration.

### 1.1 Why this Study Matters

IoT malware is an important emerging phenomenon [35], not just
because of its recent development but also because IoT devices
might not be able to run anti-malware solutions comparable to
those we use today to protect desktop computers. However, to
be able to design new solutions, it is important for the security
community to precisely understand the characteristics of the current threat landscape. This need prompted researchers to conduct
several measurement studies, focused for instance on the impact
of the Mirai botnet [3] or on the techniques used by Linux-based
malicious samples [12].
This work follows the same direction, but it is over one order of
magnitude larger than previous studies and includes all malicious
samples submitted to VirusTotal over a period of 3.5 years. A consequence of the scale of the measurement is that the manual analysis
used in previous studies had to be replaced with fully automated
comparison and clustering techniques.
Our findings are not just curiosities, but carry important consequences for future research in this field. For example, static analysis
was the preferred choice for program analysis, until researchers
showed that the widespread use of packing and obfuscation made it
unsuitable in the malware domain [34]. Our work shows that this is
not yet the case in the IoT space, and that today static code analysis
provides more accurate results than looking at dynamic sandbox

1As explained in Section 2, we included in our analysis only samples detected as
malicious by at least five AV systems.
2Dataset and figures: https://github.com/eurecom-s3/tangled_iot/


reports or static features. The fragmentation of IoT families also
casts some doubts on the ability of AV labels to characterize the
complex and tangled evolution of IoT samples.
Finally, while not our main contribution, our work also reports
on the largest clustering experiments conducted to date on dynamic
features extracted from malicious samples [5, 6, 25].

### 2 DATASET

To study the genealogy of IoT malware, our first goal was to collect
a large and representative dataset of malware samples. For this
purpose, we downloaded all ELF binaries that have been submitted
to VirusTotal [2] over a period of almost four years (from January
2015 to August 2018) and that had been flagged as malicious by at
least five anti-virus (AV) vendors. Since our goal is to analyze malware that targets IoT devices, we purposely discarded all Android
applications and shared libraries. Furthermore, we also removed
samples compiled for the Intel and AMD architectures because it is
very difficult to distinguish the binaries for embedded devices from
the binaries for Linux desktop systems. This selection criteria resulted in a dataset of 93,652 samples, one order of magnitude larger
than any other study conducted on Linux-based malware. As a comparison, the largest measurement study to date was performed on
10,548 Linux binaries [12], of which a considerable fraction (64.56%)
were malware targeting x86 desktop computers. Moreover the purpose of this dataset was to study the general behavior of modern
Linux malware and not the tangled relationships between them.
We could have easily extended our dataset to Linux malware
for desktops and servers. On the other hand, we preferred to focus
specifically on IoT malware, given their high infection rate on real
devices and the variety of the underlying hardware architectures.
This possibly requires platform customizations implemented as
ad-hoc malware variants. Moreover, less known architectures are
more likely to show those small bits which tend to be ignored on
more comfortable and extensively studied counterparts e.g., x86.
Figure 1 shows the volume of samples in our dataset submitted to
VirusTotal over the data collection period and the dramatic increase
in the number of IoT malware samples after the outbreak of the
infamous Mirai botnet in October 2016. Before that, the number
of malicious IoT binaries was very low. For instance, only 363
of our 93K samples were observed in that period. This number
progressively increased to reach an average of 7.8k new malicious
binaries per month in 2018. This trend can be attributed to several
factors, including the evolving IoT threat landscape [27, 42, 43, 45],
the source code availability of several popular families [27], and the
proliferation of IoT honeypots that allowed researchers to rapidly
collect a large number of samples spreading in the wild [45].
Table 1 reports the compilation details of the samples in our
dataset. The first two architectures, ARM 32-bit and MIPS I, account
together for two thirds of all samples. This can be explained by
the large popularity of these processor architectures for popular
consumer IoT devices commonly targeted by these malware, such
as home routers, IP cameras, printers, and NAS devices. Another
interesting aspect is the fact that almost 95% of the ELF files in
our dataset were statically linked. Additionally, as already noted
by Cozzi et al. [12], a large fraction of them (roughly 50% in our
dataset) have not been stripped from their symbols.


-----

The Tangled Genealogy of IoT Malware ACSAC 2020, December 7–11, 2020, Austin, USA

**Table 1: Breakdown of samples per architecture.**


10[5]

10[3]


**Dynamically Linked** **Statically Linked**
**CPU Architecture** **Samples No. (%)**
**Stripped** **Unstripped** **Total** **Stripped** **Unstripped** **Total**

ARM 32-bit 36,574 (39.05) 3,012 645 3,657 16,049 16,868 32,917
MIPS I 25,201 (26.91) 325 345 670 12,714 11,817 24,531
PowerPC 32-bit 10,916 (11.66) 100 258 358 5,180 5,378 10,558
SPARC 8,412 (8.98) 100 119 219 3,489 4,704 8,193
Hitachi SH 6,477 (6.92) 63 107 170 2,190 4,117 6,307
Motorola 68000 5,982 (6.39) 52 82 134 2,130 3,718 5,848
Tilera TILE-Gx 27 (0.03) 0 1 1 26 0 26
ARC International ARCompact 27 (0.03) 16 2 18 7 2 9
Interim Value tba 9 (0.01) 2 7 9 0 0 0
SPARC Version 9 8 (0.01) 1 6 7 1 0 1
PowerPC 64-bit 6 (0.01) 1 4 5 1 0 1
_Others_ 13 (0.01) 4 8 12 1 0 1
**Total** 93,652 3,676 1,584 5,260 41,788 46,604 88,392

obtained on Windows malware), if we remove Mirai and Gafgyt, a
common label was not found for one third of the remaining samples.


10[1]

Jan
2015

|Col1|Col2|Col3|Col4|
|---|---|---|---|
|||||
|n Ja 15 20|n Ja 16 20 Date|n Ja 17 20|n 18|


**Figure 1: Number of samples in our dataset submitted to**
**VirusTotal over time.**

**Table 2: Breakdown of the top 10 IoT malware families in**
**our dataset.**


**Label**
**Rank** **Samples No. (%)**
**(AVClass)**


1 Gafgyt 46,844 (50.02)
2 Mirai 33,480 (35.75)
3 Tsunami 3,364 (3.97)
4 Dnsamp 2,235 (3.59)
5 Hajime 1,685 (2.39)
6 Ddostf 840 (0.90)
7 Lightaidra 360 (0.38)
8 Pnscan 212 (0.23)
9 Skeeyah 178 (0.19)
10 VPNFilter 135 (0.14)
**Total** 89,935 (96.03)
**Unlabelled** 3,717 (3.97)

In addition to downloading the binaries, we also retrieved the
VirusTotal reports. We then processed them with AVClass [41], a
state-of-the-art technique that relies on the consensus among the
AV vendors to determine the most likely family name attributed
to malware samples. Table 2 lists the top ten AVClass labels, with
_Gafgyt and Mirai largely dominating the dataset. However, there is_
a long tail of families (90 in total) that contain only a small number
of samples. Finally, it is interesting to note that AVClass was unable
to find a consensus for a common family name for only 3.7K samples.
While this might seem very small (especially compared with figures


### 3 MALWARE LINEAGE GRAPH EXTRACTION

The field that studies the evolution of malware families and the way
malware authors reuse code between families as well as between
variants of the same family is known as malware lineage. Deriving
an accurate lineage is a difficult task, which in previous studies
has often been performed with help from manual analysis and
over a small limited number of samples [24, 31]. However, given
the scale of our dataset, we need to rely on a fully automated
solution. The traditional approach for this purpose is to perform
malware clustering based on static and dynamic features [15, 25,
28, 31]. When also the time dimension is combined in the analysis,
clustering can help derive a complete timeline of malware evolution,
also known as phylogenetic tree of the malware families.
A common and simple other way to do that would be to rely on
AV labels, more oriented to only identify macro-families.
We, on the other hand, work towards a finer-grained classification that would enable us to study differences among sub-families
and the overall intra-family evolution and relationships.
In our first attempt we decided to cluster samples based on a
broad set of both static and dynamic features. This approach not
only required a substantial amount of manual adjustments and
validation, it also always resulted in noisy clusters. As feature-based
clustering is often used in malware studies, we believe there is a
value in reporting the reasons behind its failure. We thus provide a
detailed analysis in Appendix A with the complete list of extracted
features in Appendix B.

### 3.1 Code-based Clustering


We decided to resort to a more complex and time consuming solution based on code-level analysis and function similarity. The
advantage is that code does not lie, and therefore can be used to
precisely track both the evolution over time of a given family as
well as the code reuse and functionalities borrowed among different
families.
The main drawback of clustering based on code similarity is that
the distance among two binaries is difficult to compute. Binary code
similarity is still a very active research area [21], but tools that can
scale to our dataset size are scarce and often in a prototype form.


-----

ACSAC 2020, December 7–11, 2020, Austin, USA Cozzi, et al.


Moreover, to be able to compare binaries, three important conditions must be satisfied: 1) each sample needs to be first properly
unpacked, 2) it must be possible to correctly disassemble its code,
and 3) it must be possible to separate the code of the application
from the code of its libraries. The first two constitute major problems that had hindered similar experiments on Windows malware.
However, IoT malware samples are still largely un-obfuscated and
packers are the exception instead of the norm [12]. While this is a
promising start, the third condition turns out to be a difficult issue
(ironically this is the only one not causing problems for traditional
Windows malware).
Figure 2 shows the workflow of our code-based clustering. The
process is divided in three macro phases. A First we process unstripped binaries and we analyze the symbols to locate library code
in statically linked files. B Then we perform an incremental clustering based on the code-level similarity, while propagating symbols
to each new sample. C Finally, we build the family graphs (one
for each CPU architecture) and D we use available symbols to pin
samples and clusters to code snippets we were able to scrape from
online code repositories to obtain more detailed understanding
about the evolution of malware families.
Recall that our goal is not to provide a future-proof IoT malware
analysis technique. We rather seek to identify a scalable approach
that enables us to reconstruct the lineage for the 93K samples in our
3.5 year-long dataset so we can report on their genealogy. We thus
take advantage of the current sophistication of IoT malware, which
is currently rudimentary enough to enable code-based analysis,
aware that malware authors could easily employ tricks to hinder
such analysis in the future.

### 3.2 Symbols Extraction

IoT malware is often shipped statically linked. The fact that 88,392
samples out of 93,652 (94.3%) in our dataset are statically linked
tend to confirm this assumption. This is most likely due to an effort
to ensure the samples can run on devices with various system
configurations. However, performing code similarity on statically
linked binaries is useless, as two samples would be erroneously
considered very similar simply because they might include the same
_libc library. Therefore, to be able to identify the relevant functions_
in such binaries, we first need to distinguish the user-defined code
from the library code embedded in them. Unfortunately, when
dealing with stripped binaries, this is still an open problem and the
techniques proposed to date have large margins of errors, which
are not suitable for our large-scale, unsupervised experiments.
We thus start our analysis by extracting symbols from unstripped
binaries and leveraging them to add semantics to the disassembled
code. Luckily, as depicted in Table 1, 53% of statically-linked and
30% of dynamically linked samples contain symbol information. We
used IDA Pro to recognize functions and extract their names. We
then use a simple heuristic to cut the binary in two. The idea is to
locate some library code, and then simply consider everything that
comes after library code as well. While it is possible for the linker
to interleave application and library objects in the final executable,
this would simply result in discarding part of the malware code
from our analysis. However, this is not a common behavior, and


lacking any better solution to handle this problem, this is a small
price to pay to be able to compute binary similarity on our dataset.
We therefore built a database of symbols (symbols DB in Figure 2)
extracted from different versions of Glibc and uClibc and use the
database to find a “cut” that separates user from library code. After
extracting the function symbols from unstripped ELF samples, we
start scanning them linearly with respect to their offsets. We move
a sliding window starting from the entry point function _start and
define a cutting point as soon as all of the function names within
that window have a positive match in the symbols DB. Using a
window instead of a single function match avoids erroneous cases
where a user function name may be wrongly interpreted as a library
function. We experimentally set this window size to 2 and verified
the reliability of this heuristic by manually analyzing 100 cases.
Once the cutting point is identified, all symbols before this point
are kept and the remaining ones are discarded.
We chose to operate only on libc variants for two reasons. First,
because libc is always included by default by compilers into the
final executable when producing statically linked files. Moreover,
we observed that less than 2% of the dynamically linked samples in
our dataset require other libraries on top of libc.
Finally, after removing the library code, we further filter out
other special symbols, including __start, _start and architecturedependent helpers like the __aeabi_* functions for ARM processors.

### 3.3 Binary Diffing and Symbol Propagation

Binary diffing constitutes the core of our approach as it enables us
to assess the similarity between binaries at the code level. However,
given the intrinsic differences at the (assembly) code level between
binaries of different architectures, we decided to diff together only
binaries compiled for the same architecture – therefore producing
a different clustering graph, and a different malware lineage, for
each architecture. While this choice largely reduces the number of
possible comparisons, our datasets still contains up to 36,574 files
per architecture (ARM 32-bit), making the computation of a full
similarity matrix unfeasible.
To mitigate this problem we adopt Hierarchical Navigable Small
World graphs (HNSW) [32], an efficient data structure for approximate nearest neighbor discovery in non-metric spaces, to overcome
the time complexity and discover similarities in our dataset. The
core idea that accelerates this and similar approaches [14, 17] is
that items only get compared to neighbors of previously-discovered
neighbors, drastically limiting the number of comparisons while
still maintaining high accuracy. While adding files to the HNSW,
our distance function will be called on a limited number of file
pairs (on average, adding an element to the HNSW requires only
244 comparisons in our case) while still being able to link it to its
most similar neighbors. We configured the HNSW algorithm to
take advantage of parallelism and provide high-quality results as
suggested by existing guidelines in the clustering literature [13].
We use Diaphora [1] to define our dissimilarity function for
HNSW. This function is non-metric as the triangle inequality rule
does not necessarily hold. However, in the following we will call it
_distance function without implying it is a proper metric. This has_
not consequences on the precision of our clustering, as the HNSW


-----

The Tangled Genealogy of IoT Malware ACSAC 2020, December 7–11, 2020, Austin, USA

A **Symbol extraction** B **Binary diffing and symbol propagation** C **Similarity graph**

HNSW-based binary

Function-level

diffing (diaphora)

abc similarity DB
**ELF**

def Cut **ELF** abc Filter

xyz

**Unstripped** Web scraping

Symbol propagation

**ELF** 10 Symbols DB **UnstrippedELF** abc **ELFStripped1000abc0101** code DBSource

0101 D **Source code collection**

1000

**Stripped**

**Figure 2: The workflow of our system.**

|Col1|Col2|
|---|---|
|ELF||

|ELF|Col2|
|---|---|


algorithm is explicitly designed for non-metric spaces. One of the
advantages of using Diaphora is that the tool works with all the
architectures supported by IDA Pro, which covers 11 processors
and 99.9% of the samples in our dataset, while other binary code
similarity solutions recently proposed in academia handle only few
architectures and do not provide publicly available implementations [21]. When two binaries are compared, Diaphora outputs a
per-function similarity score ranging from 0 (no match) to 1 (perfect
match). To aggregate individual function scores in a single distance
function we experimented with different solutions based on the
average, maximum, normalized average, and sum of the scores. We
finally decided to count the number of functions with similarity
greater than 0.5, which is the threshold suggested by Diaphora’s
authors to discard unreliable results. This has the advantage of
providing a symmetric score (e.g., if the similarity of A to B is 4
then the similarity of B to A is also 4) that constantly increase as
more and more matching functions are found among two binaries.
For HNSW we then report the inverse of this count to translate the
value into a distance (where higher values mean two samples are
further apart and lower values mean they share more code).
Before running HNSW to perform pairwise comparison on the
whole dataset, we unpacked 6,752 packed samples. Since they were
all based on variations of UPX, we were able to easily do that by
using a simple generic unpacker. We then add each sample to HNSW
one by one, in two rounds, sorted by their first seen timestamp on
VirusTotal (to simulate the way an analyst would proceed when
collecting new samples over time).
In the first round we added all dynamically linked or unstripped
samples, which account for 55% of the entire dataset. By relying
on the symbols extracted in the previous phase, we only perform
the binary diffing on the user-defined portion of the code, and
omit comparisons on library code. In the second round we then
added the statically linked stripped samples. Being without symbols,
there is no direct way to distinguish user functions from library
code. Attempts to recover debugging information from stripped
binaries, such as with Debin [22], only target a limited set of CPU
architectures.
We tackle this problem by leveraging the binary diffing itself
to iteratively “propagate” symbols. When a function in a stripped


sample has perfect similarity with an unstripped one, we label it
with the same symbol. This methodology enables us to perform
similarity analysis also for stripped samples, which would otherwise be discarded. However, this step comes with some limitations.
While we are able to discard library functions we also potentially
discard user functions that didn’t match any function already in the
graph. For instance, if two stripped statically linked samples share a
function that is never observed in unstripped or dynamically linked
binaries, this similarity would not be detected by our solution. We
add the stripped samples to HNSW only after the unstripped ones
have all been added to contain this problem as much as possible,
but the probabilistic nature of HNSW can decrease this benefit as
not all comparisons are computed for each sample. This means that
our graph is an under-approximation of the perfect similarity graph
(we can miss some edges that would link together different samples, but not create false connections) with over 18.7M one-to-one
binary comparisons and 595,039 function symbols propagated from
unstripped to stripped binaries.

### 3.4 Source Code Collection

The symbols extracted from unstripped malware and propagated
in the similarity phase also helps us locate and collect snippets
of source codes from online sources. In fact, the source code for
many Linux-based IoT malware families has been leaked on open
repositories hosting malicious packages ready to be compiled and
deployed. This has resulted in a very active community of developers that cloned, reused, adapted, and often re-shared variations of
existing code.
We took advantage of this to recognize open source and closed
source families, split our dataset accordingly and, more importantly,
to assign labels to groups of nodes in the similarity graph. While we
also use AV labels for this purpose, those labels often correspond to
generic family names, while online sources can help disambiguate
specific variants within the same bigger family.
To locate examples of source code, we queried search engines
with the list of user-defined function symbols extracted in the
previous phase. We were able to find several matches on public
services as GitHub or Pastebin, both for entire code bases (e.g.,


-----

ACSAC 2020, December 7–11, 2020, Austin, USA Cozzi, et al.


on GitHub) and for single source files (e.g., on Pastebin). Interestingly, on GitHub we found tens of repositories forked thousands
of times (not necessarily for malicious purposes, as often security
researchers also forked those repositories). Moreover, we found
a Russian website hosting a repository regularly populated with
several malware projects, exploits, and cross-compilation resources.
From this source alone we were able to retrieve the code of 76 variants of Gafgyt, 50 variants of Mirai, 19 projects generically referred
as “CnC Botnet” and “IRC Sources” (which resemble Tsunami variants) and a number of exploits for widely deployed router brands.
Some variants contained changelog information that made us believe these projects had been collected from leaks and underground
forums.

### 3.5 Phylogenetic Tree of IoT Malware


As a preamble to the function level similarity analysis of IoT malware we post-processed the sparse similarity graph G obtained by
running HNSW and using the distance function as weight. Since
we store in a database the detailed comparisons, the actual weight
on the similarity graph can be tuned depending on the purpose of
the analysis.
For instance, the analyst can use only best matches if the goal
is to highlight perfect similarity (e.g., code reused as is) between
two binaries, or a combination of best and partial matches if we
want to capture more generic dependencies between two binaries,
including minor variations and “evolutions” of the code.
Another problem with the similarity graph is that it contains
a large number of edges, with many samples being variations (or
simple recompilation) of the same family. Therefore, to make the
output more readable and better emphasize the evolution lines, in
our graphs we visualize the Minimum Spanning Tree (MST ) G [′] of
_G that shows the path of minimum binary difference among all_
samples. This approach to cluster binaries is inspired by the works
in clustering literature that are based on the minimum spanning
tree (MST) of the pairwise distance matrix between elements [4, 11].
Furthermore, we observed that MSTs—which are in general
used as an intermediate representation of the clustering structure—
faithfully convey information about the relationships between items
in our dataset which is not always preserved when converting the
MST to a set of clusters. For this reason, we base our analysis on
minimum spanning trees.
The tree can be further colored according to AV labels (to get
an overview of the relationships among different families and spot
erroneous labels assigned by AV engines) or to the closest source
file we downloaded using the symbol names (thus leading to a more
clear picture of the genealogy of a single malware family). In the
next sections we will explore these two views and present a number
of examples of the main findings.

### 4 RESULTS


**Table 3: Common functions across top10 malware families.**

VS

Gafgyt 115 189 3 1 2 18     -     -     
Mirai 63 1 1      - 2      -      -      Tsunami 4    - 3 1    -    -    
Dnsamp    - 65    -    -    -    
Hajime     -     -     -     -     
Ddostf     -     -     -     
Lightaidra     -     -     Pnscan     -     
Skeeyayh    
VPNFilter

persistence on VirusTotal. All three started to present fused traits
over time and they still hit on VirusTotal. On the other hand, more
specialized IoT malware targets specific CPU architectures and
have a much shorter appearance. Today IoT malware code is not as
complex as the one found in Windows malware, yet AVs may lose
robustness when it comes to identifying widely reused functions
and packed samples.
As described in Section 3.5, the distance function we used for
the HNSW algorithm is based on the number of functions with
binary similarity ≥ 0.5 (as suggested by Diaphora). The analyst
can then adjust this threshold when plotting the graphs to either
display even uncertain similarities among families (at 0.5 threshold)
or highlight only the perfect matches of exact code reuse (at 1.0
threshold).


We used the workflow for code-based clustering presented in the
previous section to plot phylogenetic trees for the six top architectures in our dataset. We found that the current IoT malware
scene is mainly invaded by three families tightly connected to each
other: Gafgyt, Mirai and Tsunami. They contain hundreds of variants grouped under the same AV label and are the ones with longer


### 4.1 Code Reuse

Figure 3 shows the lineage graph for MIPS samples plotted at similarity ≥ 0.9 and with node colored according to their AVClass
labels.
Overall, MIPS samples include 39 different labels. However, the
graph is dominated by few large families: Gafgyt, Tsunami and
_Mirai. These three families cover 87% of the MIPS samples and they_
are also the ones that served as inspiration for different groups of
malware developers, most likely because of the fact their source
code can be found online. It is interesting to note how this tangled
dependency is reflected in the fact that the most of the Tsunami
variants are located on the left side of the picture close to Gafgyt,
but some of them appear also on the right side due to an increased
number of routines borrowed from the Mirai code.
Besides these three main players, the graph also shows samples
without any label or belonging to minor families. For example,
the zoom region [A] contains a small connected component of
283 Dnsamp samples with a tail of 4 samples: 1 with label Ganiw
and 3 with label Kluh. All together are linked to ChinaZ, a group
known for developing DDoS ELF malware. The very high similarity
between Ganiw and Kluh seems to be more interesting, since Kluh
could be seen as an evolution of the first (and appeared 3 months
after on VirusTotal), yet AVs assign them different labels.
Table 3 reports the number of shared functions (at 0.9 similarity)
across the top 10 families in our dataset and takes into account


-----

The Tangled Genealogy of IoT Malware ACSAC 2020, December 7–11, 2020, Austin, USA

#### A B

Gafgyt
Mirai
Tsunami
Dnsamp

**Figure 3: Lineage graph of MIPS samples colored by family.**


**Table 4: Outlier samples and AVClass labels**

**Number of samples**
**Architecture**
**Wrong label** **Without label** **Total**

ARM 32-bit 19 9 28

MIPS I 25 41 66

PowerPC 1 4 5

SPARC 2 0 2

Hitachi SH 7 0 7
Motorola 6800 8 2 10

**Total** 62 56 118

the full picture of the six main architectures. The code sharing
for Mirai, Gafgyt and Tsunami is once again confirmed to play
a fundamental role in IoT malware with hundreds of functions
shared across the three. However, we can see their incidence in
minor families like Dnsamp, which borrows functions for random
numbers generation and checksum computations, or Lightaidra,
reusing 18 functions from Gafgyt. Less widespread families such
as Dnsamp and Ddostf also show high similarity with a total of 65
shared functions. Instead, targeted campaigns like VPNFilter do not
overlap with main components of the famous families.


### 4.2 Outliers and AV Errors

One of the analysis we can perform on the phylogenetic trees is
the detection of anomalous labels, by looking for outlier nodes. We
define as outlier a (set of) nodes of one color which is part of a
cluster that contains only nodes of a different color. Outliers can
correspond to samples that are misclassified by the majority of AV
scanners or to variants of a given family that have a considerable
amount of code in common with another family (and for which,
therefore, it is difficult to decide which label is more appropriate).
But outlier can also be used to assign a label, based on its neighbors,
to samples for which AVClass did not return one.
Although the number of mislabelled samples is not significant in
our dataset, we can use our automated pipeline to promptly detect
suspicious cases in newly collected data. The outliers discussed in
this section also show that a very high ratio of code similarity can
often confuse several AV signatures.
Based on a manual inspection of each group of outliers, Table 4
reports a lower bound estimation of the mislabelling cases broken
down by architecture. Overall we found 118 cases with 62 samples
we believe to have a wrong AVClass label and 56 for which AVClass
was not able to agree on the AV labels. ARM and MIPS (which cover
66% of our dataset) are responsible for over 80% of the errors, with
MIPS samples being apparently the most problematic to classify.
The pattern is reversed for less popular architectures, like Hitachi
SH and Motorola 68000 (13.3% of the dataset) that account for


-----

ACSAC 2020, December 7–11, 2020, Austin, USA Cozzi, et al.


**Table 5: Number of variants recognized for top 10 families in**
**our dataset. Malware families with - contained only stripped**
**samples which prevented any accurate variant identifica-**
**tion.**

**Candidate** **Validated** **Number of samples** **Persistence (days)**
**Family** **Variants**

**Variants** **(Source** **Min.** **Max.** **Avg.** **Max.** **Avg.**
**code)**


Gafgyt 1428 140 1 4499 285.59 1210 283.21

Mirai 386 57 1 776 39.05 661 103.35

Tsunami 210 27 1 544 93.59 1261 421.63

Dnsamp 48 4 3 1394 362.75 1444 691.25

Hajime 1 1 1 1 1 1 1.00

Ddostf 11 3 2 755 260.00 483 308.33

Lightaidra 7 7 1 4 1.43 299 43.57

Pnscan 1 1 2 2 2.00 1 1.00

Skeeyah - - - - - - 
VPNFilter - - - - - - 

**Total** 240 2091


150

100


0

Jan
2015

|150 100 50|Col2|Col3|Col4|Col5|Col6|Col7|
|---|---|---|---|---|---|---|
|||gafgy mirai|t||||
||tsuna dnsam||mi p||||
||||||||


Jan
2016


Jan Jan
2017 2018

Date


**Figure 4: Appearance of new variants over time.**

17 mislabelled samples, while PowerPC and SPARC (20.6% of the
dataset) had only 7 cases.
Looking closer, all cases of wrong labels seemed to be due to
a high portion of code reuse between two or more families. The
zoom region [B] in Figure 3 is an example of this type of errors. A
Tsunami variant that borrows a number of utility functions from
Mirai resulted in few of its samples being misclassified as Gafgyt
by many AV vendors.
Another example, this time related to a smaller family, is a set of
12 Remaiten samples that AVClass reported as Gafgyt (Remaiten is
a botnet discovered by ESET that reuse both Tsunami and Gafgyt
code, that extend with a set of new features). We also observed
that in some cases AVs assign different labels for samples with an
almost full code overlap. For example, under PowerPC, a binary is
assigned the label Pilkah, thus giving birth to a new family, even if
it is only a very minor variation of Lightaidra.
Finally, we found examples of how an extremely simple and well
known packer like UPX can still cause troubles to AV software. For
instance, 29 packed samples for MIPS did not get an AVClass label
even if their code was very close to Gafgyt.


### 4.3 Variants

The phylogenetic trees produced by our method can also be used
to identify fine-grained modifications and relationships among
variants within the same malware family.
In order to bootstrap the identification of variants we decided to
take advantage of the binary similarity-based symbol propagation
described in Section 3.3. As a first step we identify candidate variants by grouping all malware samples based on their set of unique
symbols. These symbols were either present in the binary (in case
of an original unstripped binary) or were propagated from other unstripped binaries (in case of an original stripped binary). While this
symbol-based variant identification technique is subject to errors –
noise from symbol extraction, incomplete symbol propagation – it
gives a first estimate of the number of variants by capturing finegrained differences such as added, removed or renamed functions.
Table 5 provides the number of identified variants for the top 10
largest malware families in our dataset. We can see that Gafgyt, and
to a less extent Mirai and Tsunami appear to have spurred more
than 2,000 variants all together. This phenomenon is supposedly
fueled by the availability of the source code online for these three
major malware families. It is important to note that given that this
step relies on symbols it excludes all stripped samples for which
symbols could not be propagated, e.g., all samples of the VPNFilter
malware were stripped hindering the identification of variants.
As a second step we rely on the leaked source code collected from
online repositories, as described in Section 3.4 to validate previously
identified variants. By matching symbols found or propagated in
the binaries with functions found in the source code we were able
to validate more than 200 variants. It is interesting to see that
as much as 50.3% of the samples had at least a partial match to
our collected source code – but only 740 samples resulted in a
perfect match of the entire code. This suggests that many malware
authors take inspiration from leaked source code, yet they introduce
new modifications, thus creating new independent variants. The
surprisingly high number of variants having their source code
online is a great opportunity for us to validate and better study them.
Validating the others unfortunately require time-consuming manual
analysis. From Table 5 we can see that the collected source code
enabled us to validate 240 variants with Gafgyt taking a slice equal
to 58% of the total, followed by Mirai with a lower share of almost
24%. The source code we collected matched also minor families.
For example Hajime, known to come with stripped symbols, was
found to have one sample referring to the Gafgyt and Mirai variant
_Okane, actually suggesting the Hajime sample was misclassified_
by AVs. In a similar way, two samples of Pnscan partially matched
with a port scanner tool named like the family and available on
GitHub. However, the authors of these samples introduced new
functionalities to the original code. While the availability of IoT
malware source code online facilitates the development of variants,
it can also be leveraged to identify and validate them. Finally, in
order to evaluate the accuracy of the source code matching we took
an extra step and manually verified and confirmed some of the
variants that matched source code.
Another important aspect to understand the genealogy of IoT
malware is the combination of binary data with timing information.
By measuring the first and last time associated to each variant we


-----

The Tangled Genealogy of IoT Malware ACSAC 2020, December 7–11, 2020, Austin, USA


can get a temporal window in which the samples of each variants
appeared in the wild (shown in the last two columns of Table 5).
Here we can notice how quickly-evolving families like Gafgyt and
_Mirai tend to result in short-lived variants. For instance, Gafgyt_
variants appeared in VirusTotal for an average of 10 months, and
_Mirai variants for four. Instead, Tsunami and Dnsamp variants_
persisted for longer periods: respectively one year and two months
the first and almost two years for the second. Figure 4 shows, in a
cumulative graph, the number of new variants that appeared over
time for the three main families. It is interesting to observe the
almost constant new release of Gafgyt variations over time, the
slower increase of Tsunami variants, and the rapid proliferation of
_Mirai-based malware in 2018._

### 5 CASE STUDIES

After showing our automated approach for systematic identification of code reuse in Section 3 and presenting an overview of the
phylogenetic tree in Section 4, we now discuss in more details two
case studies. We use these examples as an opportunity to provide a
closer look at two individual families and discuss their evolution
and the multitude of internal variants.
It is important to note that the exact time at which each sample
was developed is particularly difficult to identify as malware could
remain undetected for long periods of time. Since ELF files do not
contain a timestamp of when they were compiled, we can only rely
on public discussions and on the VirusTotal first submission time
as source for our labeling. Some families are only discussed in blog
posts by authors that did not submit their samples to VirusTotal.
Previous research also found that for APT campaigns the initial
VirusTotal collection time often pre-dates the time in which the
samples are “discovered” and analyzed by human experts by months
or even years [19]. Therefore, in our analysis we simply report the
earliest date among the ones we found in online sources and among
all samples submitted for the same variant to VirusTotal. However,
this effort is only performed for presentation purpose, as we believe
that detecting the similarities and changes among samples (the goal
of our analysis) is more important than determining which ones
came first.

### Example I – Tsunami (medium-sized family)

_Tsunami is a popular IRC botnet with DDoS capabilities whose_
samples represent almost 4% of our dataset. Its code is available
online and gives birth to a continuous proliferation of new variants,
sometimes with minimal differences, other times with major improvements (i.e., new exploits and new functionalities). Tsunami’s
main goal is to compromise as many devices as possible to build
large DDoS botnets. Therefore, we obtained samples compiled also
for less common architectures such as Motorola 68K or SuperH.
Overall, 76% of its samples are statically linked but with the original symbols in place. When constructing the genealogy graph of
_Tsunami, we not only took advantage of the extracted symbols_
from the binaries but we also cross-correlated them with available
source code of multiple variants we scraped from online forums, as
explained in Section 3.4. This way we were able to color the graph
and assign a name to different variants.


The top part of Figure 5 shows the mini-graph for six different
architectures. The main part of the figure further zooms in on the
evolution of a group of 748 ARM 32-bit binaries. These samples all
share the main functionality of Tsunami and therefore the functions
for DDoSing and contacting the CnC remained the same across all
of them.
On the most right of Figure 5, there is a visible section in which
the vast majority of samples are labeled as Kstd according to the AV
labels. With only two flooders, Kstd represents one of the oldest and
most famous sub-family which acted as a skeleton and inspiration
for newer malware strains. By moving left on the graph, we meet a
fairly high dynamic area with binaries very similar to each other but
with new features such as frequent updates and new flooders. The
first samples in this group correspond to the Capsaicin sub-family,
for which we performed a manual investigation to identify the
new functionalities. Capsaicin includes 16 flooders based on TCP,
UDP and amplification attacks. It uses gcc directly on the infected
device, taking its presence for granted. Some Tsunami variants are
also examples of inter-family code reuse, with code borrowed from
both Mirai and Gafgyt. For example, Capsaicin borrows from Mirai
the code for the random generation of IP addresses that is used
to locate candidate victims to infect. Some Tsunami samples also
perform horizontal movement reusing Mirai’s Telnet scanner or
SSH scanners also found in Gafgyt, while others use open source
code as inspiration (e.g., the Uzi scanner).
Moving left we then encounter the Weebsquad and Uzi variants.
The first is a branch spreading over Telnet and SSH, for which we
could not find any online source code that matched our samples.
We named these variants based on the fact that they all included
their name in the binaries. Interestingly some AVs on VirusTotal
mislabeled these samples as Gafgyt, possibly because of the codereuse between Tsunami and Gafgyt we mentioned earlier.
In the left side of Figure 5 we encounter Kaiten, another popular
variant from which many malware writers forked their code to
create their own projects. For instance, Zbot (bottom-left on the
graph) is a Kaiten fork available on GitHub, in which the authors
added two additional flooder components.
Our similarity analysis also recognizes Amnesia, a variant which
was discovered by M. Malík of ESET in January 2017. This subfamily includes exploits for CCTV systems and it is one of the
rare Linux malware adopting Virtual Machine (VM) detection techniques. Unlike most of the samples in the graph, Amnesia is stripped
and dynamically linked. However, our system detected very high
code similarity with another unstripped sample which uses the
same CCTV scanner and persistence techniques, but without VM
detection capabilities. Thanks to our symbol propagation technique
we also managed to connect the Amnesia samples to the rest of the
family graph.

### Example II – Gafgyt (large family)

_Gafgyt is the most active IoT botnet to date. It is comprised of_
hundreds of variants and is the biggest family in our dataset with
50% of the samples. It targets home routers and other classes of
vulnerable devices, including gaming services [20].
We visualize the code-similarity analysis of samples for ARM
32-bit in Figure 6. Our system identified more than 100 individual


-----

ACSAC 2020, December 7–11, 2020, Austin, USA Cozzi, et al.

MIPS I

_amnesia_ _"weebsquad"_ _capsaicin_

                   - CCTV exploit _- initial version similar_                   - frequent updates

                   - VM detection to kaiten                    - new flooders

                  - VM wipe _- high code reuse after_                  - upgrade itself using local gcc

                     - root/user persistence _- SSH scanner_                      - Mirai IP generator

_- Telnet scanner_

SuperH

Motorola 68k

ARM 32-bit

_zbot_

_kstd_

                                        - initially kaiten-like

                                                                              - very basic

SPARC      - then reuse flooders _uzi- "uzi" Telnet scanner_      - only a couple of flooders

                                            - new CnC commands

_kaiten_

PowerPC      - supposedly used to name Colors represent a variant

the familyTsunami Stripped samples

Binary similarity

**Figure 5: Lineage graph of Tsunami samples for ARM 32-bit.**

|MIPS I|Col2|Col3|
|---|---|---|
|MIPS I SuperH Motorola 68k|amnesia "weebsquad" capsaicin - CCTV exploit - initial version similar - frequent updates - VM detection to kaiten - new flooders - VM wipe - high code reuse after - upgrade itself using local gcc - root/user persistence - SSH scanner - Mirai IP generator - Telnet scanner zbot kstd - initially kaiten-like - very basic - then reuse flooders uzi - only a couple of flooders - "uzi" Telnet scanner - new CnC commands kaiten - supposedly used to name Colors represent a variant the familyTsunami Stripped samples Binary similarity||
|ARM 32-bit|||
|SPARC PowerPC|||
|||Colors represent a variant Stripped samples Binary similarity|


variants. Like the Tsunami case study, we were often able to leverage
available source code snippets to validate the identified variants.
The graph is clearly split into two main areas, with binaries
compiled with THUMB mode support on the left, and with ARM
_mode only on the right. Since the two halves are specular we label_
variants separately on one or the other side of the graph to improve
readability. Bashlite is believed to be one of the first variants of
_Gafgyt. Its samples are often mistaken for the Qbot variant (the_
two are frequently presented as a synonym) but their code presents
significant differences. For example, Qbot uses two additional attack
techniques (e.g., DDoS using the GNU utility wget). Our method
rightfully recognizes them as belonging to the same family but as
distinct variants.
The next cluster in our genealogy refers to Razor, which fully
reuses the previous source code but adds an additional CnC command to clear log files, delete the shell history, and disable iptables.
_Prometheus, for which we crawled two bundles, is an example of_
malware versioning. Among the features of Prometheus, we see
self upgrade capabilities and usage of Python scripts (served by the
CnC) for scanning. Its maintainer added a Netis[3] scanner in V4 to
reinforce self propagation through exploitation. Self propagation
and infection is further enhanced in Galaxy with a scanner dubbed
_BCM and one called Phone suggesting it targets real phones. Next_
to Galaxy we find an almost one-to-one fork we call Remastered

3Netis (a.k.a. Netcore in China) is a brand of routers found to contain an RCE vulnerability in 2014 [46].


which does a less intrusive cleanup procedure, cleaning temporary directories and history but without stopping iptables and
firewalld.
Finally, in the top left-hand corner of Figure 6 we uncover Angels,
reusing some of Mirai’s code for random IP generation (like other
variants) and targeting specific subnets hard-coded in the binaries.

### 6 RELATED WORK

**IoT Malware Landscape. Researchers have so far mostly focused**
on analyzing the current state of attacks and malware targeting IoT
devices, usually by deploying honeypots [35, 45]. Antonakakis et
_al. [3] provided a comprehensive forensic study of the Mirai bot-_
net, reconstructing it history and tracking its various evolutions.
Cozzi et al. [12] carried out a systematic study of Linux malware
describing some of the trends in their behavior and level of sophistication. This work included samples developed for different
Linux-based OSes, without a particular focus on IoT systems. Different from this work, our objective is to study the relationships
among IoT malware families (e.g., code reuse) and track sub-family
variants and coding practices observed on a dataset an order of
magnitude larger (93K samples vs. 10K) collected over a period of
3.5 years.
**Malware lineage. First introduced in 1998 by Goldberg et al. [18],**
the concept of malware phylogenetic trees inspired by the study
of the evolution of biological species. Karim et al.in [28] presented
a code fragment permutation-based technique to reconstruct malware family evolution trees. In 2011, Dumitras et al. [15] presented


-----

The Tangled Genealogy of IoT Malware ACSAC 2020, December 7–11, 2020, Austin, USA

_angels_

_"dankmeme"_

_lovesec_

_"remastered"_

_qbot_

_galaxy v4_

_razor_

_prometheus_

_prometheus v4_

_bashlite_

Colors represent a variant

Stripped samples

Binary similarity

**Figure 6: Lineage graph of Gafgyt samples for ARM 32-bit.**


-----

ACSAC 2020, December 7–11, 2020, Austin, USA Cozzi, et al.


some guidance on malware lineage studies: (i) use of a combination of static and dynamic features, e.g., code fragments, dynamic
CFGs, and (ii) use of time and source information related to studied
samples. Lindorfer et al. [31] developed Beagle, a system designed
to track the evolution of a malware family. They rely on dynamic
analysis to extract the different functionalities – in terms of API
calls – exhibited by a piece of malware. They then try to map these
functionalities back to disassembled code so they can identify and
characterise mutations in the malware family. Calleja et al. recently
analyzed in [10] – extending from their previous work [9] – the
evolution of 456 Windows malware samples observed over a period
of 40+ years and identified code reuse between different families as
well as with benign software. The types of code reuse they observed
include essentially anti-analysis routines, shellcode, data such as
credentials for brute-forcing attacks, and utility functions.
Recently, Haq et al. [21] reviewed 61 approaches from the literature on binary code similarity – some of which are used for
malware lineage inference [24, 26, 33] – published over the last 20
years. While they purposely focus on academic contributions rather
than binary diffing tools, they highlight the diversity, strengths and
weaknesses of all these techniques. They also identify several open
problems, some of which were faced in this work, such as scalability
and lack of support of multiple CPU architectures. BinSequence [24]
computes the similarity between functions extracted from binaries
at the instruction, basic block and CFG levels. Authors applied their
technique on different scenarios, including the identification of code
reuse in two Windows malware families. They also claim a function matching accuracy higher than 90% and above state-of-the-art
approaches such as Bindiff or Diaphora. iLine [26] is a graph-based
lineage recovering tool based on a combination of low-level binary
features, code-level basic blocks and binary execution traces. It is
evaluated on a small dataset of 84 Windows malware and claim
an accuracy of 72%. Ming et al. [33] also proposed an optimisation
for the iBinHunt binary diffing tool, which computes similarity
between binaries from their execution traces. They further apply
their tool on a dataset of 145 Windows malware samples from 12
different families.
While these approaches for binary similarity and lineage inference provide invaluable insights when applied in the context in
which they were developed, none of them can be applied on Linuxbased IoT malware. First of all few of them are able to handle Linux
binaries, and those that can typically do not go beyond the ARM
and MIPS architectures. We also believe that binary-level or basic
block-based malware slicing is likely to be prone to over-specific
code reuse identification. Similarly, execution traces are likely to
be too coarse-grained for variant identification. Additionally, we
have witnessed in our dataset that, when used, packing of IoT malware can easily be evaded. As a result, given the reasonably low
obfuscation of the IoT malware in our dataset we have decided to
take this opportunity to use function-level binary diffing to identify
relevant code similarities between and within IoT malware families.
Finally, the lack of any available scalable Linux-compatible multiarchitecture binary similarity technique led us to choose the open
source binary diffing (IDA plugin) tool Diaphora [1].


### 7 CONCLUSION

We have presented the largest study known to date over a dataset
consisting of 93K malicious samples. We use binary similaritybased techniques to uncover more than 1500 malware variants and
validate more than 200 of them thanks to their source code leaked
online. AV signatures appear to be not robust enough against small
modifications inside binaries. As such rewriting a specific function
or borrowing it from another family can be enough to derail AVs
often leading to mislabeling or missed detections.

### ACKNOWLEDGMENTS

We are greateful to Karl Hiramoto from VirusTotal for assisting us
with the binary samples and VirusTotal reports used for this study.
This research was supported by the European Research Council
(ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771844 - BitCrumbs).

### REFERENCES

[[1] [n.d.]. Diaphora, a free and open source program diffing tool. http://diaphora.re/.](http://diaphora.re/)

[[2] [n.d.]. VirusTotal. https://www.virustotal.com/.](https://www.virustotal.com/)

[3] Manos Antonakakis, Tim April, Michael Bailey, Matt Bernhard, Elie Bursztein,
Jaime Cochran, Zakir Durumeric, J Alex Halderman, Luca Invernizzi, Michalis
Kallitsis, et al. 2017. Understanding the mirai botnet. In USENIX Security.

[4] T. Asano, B. Bhattacharya, M. Keil, and F. Yao. 1988. Clustering Algorithms
Based on Minimum and Maximum Spanning Trees. In Proceedings of the Fourth
_Annual Symposium on Computational Geometry (Urbana-Champaign, Illinois,_
USA) (SCG ’88). Association for Computing Machinery, New York, NY, USA,
[252–257. https://doi.org/10.1145/73393.73419](https://doi.org/10.1145/73393.73419)

[5] Michael Bailey, Jon Oberheide, Jon Andersen, Z Morley Mao, Farnam Jahanian,
and Jose Nazario. 2007. Automated classification and analysis of internet malware.
In RAID.

[6] Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel,
and Engin Kirda. 2009. Scalable, behavior-based malware clustering.. In NDSS.

[7] BitDefender. 2018. New Hide ‘N Seek IoT Botnet using custom-built Peer-to-Peer
[communication spotted in the wild. https://labs.bitdefender.com/2018/01/new-](https://labs.bitdefender.com/2018/01/new-hide-n-seek-iot-botnet-using-custom-built-peer-to-peer-communication-spotted-in-the-wild/)
[hide-n-seek-iot-botnet-using-custom-built-peer-to-peer-communication-](https://labs.bitdefender.com/2018/01/new-hide-n-seek-iot-botnet-using-custom-built-peer-to-peer-communication-spotted-in-the-wild/)
[spotted-in-the-wild/.](https://labs.bitdefender.com/2018/01/new-hide-n-seek-iot-botnet-using-custom-built-peer-to-peer-communication-spotted-in-the-wild/)

[8] BleepingComputer. 2019. Cr1ptT0r Ransomware Infects D-Link NAS Devices,
[Targets Embedded Systems. https://www.bleepingcomputer.com/news/security/](https://www.bleepingcomputer.com/news/security/cr1ptt0r-ransomware-infects-d-link-nas-devices-targets-embedded-systems/)
[cr1ptt0r-ransomware-infects-d-link-nas-devices-targets-embedded-systems/.](https://www.bleepingcomputer.com/news/security/cr1ptt0r-ransomware-infects-d-link-nas-devices-targets-embedded-systems/)

[9] Alejandro Calleja, Juan Tapiador, and Juan Caballero. 2016. A Look into 30
Years of Malware Development from a Software Metrics Perspective, Vol. 9854.
325–345.

[10] Alejandro Calleja, Juan Tapiador, and Juan Caballero. 2018. The MalSource
Dataset: Quantifying Complexity and Code Reuse in Malware Development. (11
2018).

[11] Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. 2013. Density-based
clustering based on hierarchical density estimates. In PAKDD.

[12] Emanuele Cozzi, Mariano Graziano, Yanick Fratantonio, and Davide Balzarotti.
2018. Understanding Linux Malware. In IEEE S&P.

[13] Matteo Dell’Amico. 2019. FISHDBC: Flexible, Incremental, Scalable,
Hierarchical Density-Based Clustering for Arbitrary Data and Distance.
[arXiv:1910.07283 [cs.LG]](https://arxiv.org/abs/1910.07283)

[14] Wei Dong, Charikar Moses, and Kai Li. 2011. Efficient k-nearest neighbor graph
construction for generic similarity measures. In Proceedings of the 20th interna_tional conference on World wide web. ACM, 577–586._

[15] Tudor Dumitraş and Iulian Neamtiu. 2011. Experimental Challenges in Cyber
Security: A Story of Provenance and Lineage for Malware. In CEST.

[16] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A densitybased algorithm for discovering clusters in large spatial databases with noise.. In
_KDD._

[17] Cong Fu, Chao Xiang, Changxu Wang, and Deng Cai. 2019. Fast approximate
nearest neighbor search with the navigating spreading-out graph. Proceedings of
_the VLDB Endowment 12, 5 (2019), 461–474._

[18] Leslie Ann Goldberg, Paul W Goldberg, Cynthia A Phillips, and Gregory B Sorkin.
1998. Constructing Computer Virus Phylogenies. J. Algorithms 26, 1 (1998).

[19] Mariano Graziano, Davide Canali, Leyla Bilge, Andrea Lanzi, and Davide
Balzarotti. 2015. Needles in a Haystack: Mining Information from Public Dynamic
Analysis Sandboxes for Malware Intelligence. In Proceedings of the 24rd USENIX
_Security Symposium (USENIX Security)._


-----

The Tangled Genealogy of IoT Malware ACSAC 2020, December 7–11, 2020, Austin, USA


[20] M. Hao. [n.d.]. A Look into the Gafgyt Botnet Trends from the Communication Traffic Log. [https://nsfocusglobal.com/look-gafgyt-botnet-trends-](https://nsfocusglobal.com/look-gafgyt-botnet-trends-communication-traffic-log/)
[communication-traffic-log/.](https://nsfocusglobal.com/look-gafgyt-botnet-trends-communication-traffic-log/)

[21] Irfan Ul Haq and Juan Caballero. 2019. A Survey of Binary Code Similarity.
[arXiv:1909.11424 [cs.CR]](https://arxiv.org/abs/1909.11424)

[22] Jingxuan He, Pesho Ivanov, Petar Tsankov, Veselin Raychev, and Martin Vechev.
2018. Debin: Predicting debug information in stripped binaries. In Proceedings
_of the 2018 ACM SIGSAC Conference on Computer and Communications Security._
ACM, 1667–1680.

[23] Xin Hu, Kang G Shin, Sandeep Bhatkar, and Kent Griffin. 2013. Mutantx-s:
Scalable malware clustering based on static features. In USENIX ATC.

[24] He Huang, Amr M. Youssef, and Mourad Debbabi. 2017. BinSequence: Fast,
Accurate and Scalable Binary Code Reuse Detection. In Proceedings of the 2017
_ACM on Asia Conference on Computer and Communications Security (ASIA CCS_
_’17). ACM, 155–166._

[25] Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. BitShred: Feature
Hashing Malware for Scalable Triage and Semantic Analysis. In ACM CCS.

[26] Jiyong Jang, Maverick Woo, and David Brumley. 2013. Towards Automatic
Software Lineage Inference. In 22nd USENIX Security Symposium (USENIX Security
_[13). USENIX, Washington, D.C., 81–96. https://www.usenix.org/conference/](https://www.usenix.org/conference/usenixsecurity13/technical-sessions/papers/jang)_
[usenixsecurity13/technical-sessions/papers/jang](https://www.usenix.org/conference/usenixsecurity13/technical-sessions/papers/jang)

[27] Rommel Joven, Jasper Manuel, and David Maciejack. 2018. Mirai: Beyond the Aftermath. https://www.botconf [.eu/wp-content/uploads/2018/12/2018-R-Joven-](https://www.botconf.eu/wp-content/uploads/2018/12/2018-R-Joven-Mirai-Beyond-the-Aftermath.pdf)
[Mirai-Beyond-the-Aftermath.pdf.](https://www.botconf.eu/wp-content/uploads/2018/12/2018-R-Joven-Mirai-Beyond-the-Aftermath.pdf)

[28] Md Enamul Karim, Andrew Walenstein, Arun Lakhotia, and Laxmi Parida. 2005.
Malware phylogeny generation using permutations of code. Journal in Computer
_Virology 1 (11 2005)._

[29] Dhilung Kirat and Giovanni Vigna. 2015. Malgene: Automatic extraction of
malware analysis evasion signature. In ACM CCS.

[30] Peng Li, Limin Liu, Debin Gao, and Michael K Reiter. 2010. On challenges in
evaluating malware clustering. In RAID.

[31] Martina Lindorfer, Alessandro Di Federico, Federico Maggi, Paolo Milani Comparetti, and Stefano Zanero. 2012. Lines of Malicious Code: Insights into the
Malicious Software Industry. In Proceedings of the 28th Annual Computer Security
_Applications Conference (ACSAC ’12). ACM, 349–358._

[32] Y. A. Malkov and D. A. Yashunin. 2018. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. IEEE
_Transactions on Pattern Analysis and Machine Intelligence (2018), 1–1._ [https:](https://doi.org/10.1109/TPAMI.2018.2889473)
[//doi.org/10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473)

[33] Jiang Ming, Dongpeng Xu, and Dinghao Wu. 2015. Memoized Semantics-Based
Binary Diffing with Application to Malware Lineage Inference. In IFIP Advances
_[in Information and Communication Technology, Vol. 455. 416–430. https://doi.org/](https://doi.org/10.1007/978-3-319-18467-8_28)_
[10.1007/978-3-319-18467-828](https://doi.org/10.1007/978-3-319-18467-8_28)

[34] A. Moser, C. Kruegel, and E. Kirda. 2007. Limits of Static Analysis for Malware
Detection. In Twenty-Third Annual Computer Security Applications Conference
_(ACSAC 2007). 421–430._

[35] Yin Minn Pa Pa, Shogo Suzuki, Katsunari Yoshioka, Tsutomu Matsumoto,
Takahiro Kasama, and Christian Rossow. 2015. IoTPOT: analysing the rise of IoT
compromises. In WOOT.

[36] PaloAlto Networks. 2019. Home & Small Office Wireless Routers Exploited
[to Attack Gaming Servers. https://unit42.paloaltonetworks.com/home-small-](https://unit42.paloaltonetworks.com/home-small-office-wireless-routers-exploited-to-attack-gaming-servers/)
[office-wireless-routers-exploited-to-attack-gaming-servers/.](https://unit42.paloaltonetworks.com/home-small-office-wireless-routers-exploited-to-attack-gaming-servers/)

[37] Leo Hyun Park, Jungbeen Yu, Hong-Koo Kang, Taejin Lee, and Taekyoung Kwon.
2020. Birds of a Feature: Intrafamily Clustering for Version Identification of
Packed Malware. IEEE Systems Journal (2020).

[38] Roberto Perdisci, Wenke Lee, and Nick Feamster. 2010. Behavioral Clustering
of HTTP-based Malware and Signature Generation Using Malicious Network
Traces. In NSDI.

[39] Roberto Perdisci and ManChon U. 2012. VAMO: Towards a Fully Automated
Malware Clustering Validity Analysis. In ACSAC.

[40] Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei
Xu. 2017. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use
DBSCAN. ACM Trans. Database Syst. 42, 3, Article 19 (July 2017), 21 pages.
[https://doi.org/10.1145/3068335](https://doi.org/10.1145/3068335)

[41] Marcos Sebastian, Richard Rivera, Platon Kotzias, and Juan Caballero. 2016. AVclass: A Tool for Massive Malware Labeling. In RAID.

[42] Symantec. 2018. Symantec Internet Security Threat Report (ISTR).
[https://www.symantec.com/content/dam/symantec/docs/reports/istr-23-](https://www.symantec.com/content/dam/symantec/docs/reports/istr-23-2018-en.pdf)
[2018-en.pdf.](https://www.symantec.com/content/dam/symantec/docs/reports/istr-23-2018-en.pdf)

[43] Symantec. 2019. Symantec Internet Security Threat Report (ISTR).
[https://www.symantec.com/content/dam/symantec/docs/reports/istr-24-](https://www.symantec.com/content/dam/symantec/docs/reports/istr-24-2019-en.pdf)
[2019-en.pdf.](https://www.symantec.com/content/dam/symantec/docs/reports/istr-24-2019-en.pdf)

[44] Talos. 2018. New VPNFilter malware targets at least 500K networking devices
[worldwide. https://blog.talosintelligence.com/2018/05/VPNFilter.html.](https://blog.talosintelligence.com/2018/05/VPNFilter.html)

[45] Pierre-Antoine Vervier and Yun Shen. 2018. Before Toasters Rise Up: A View
into the Emerging IoT Threat Landscape. In RAID.

[46] T. Yeh. [n.d.]. Netis Routers Leave Wide Open Backdoor. [https:](https://blog.trendmicro.com/trendlabs-security-intelligence/netis-routers-leave-wide-open-backdoor/)
[//blog.trendmicro.com/trendlabs-security-intelligence/netis-routers-leave-](https://blog.trendmicro.com/trendlabs-security-intelligence/netis-routers-leave-wide-open-backdoor/)

|1600 1400 1200 samples 1000 800 of Number 600 400 200 0|Col2|Col3|Col4|Statically linked Dynamically linked|Col6|
|---|---|---|---|---|---|
|||||||
|||||||
|||||||
|||||||
|||||||
|||||||
|||||||
|||||||
|||||||


**Figure 7: File size distribution of malware in the dataset.**

### A FEATURES-BASED CLUSTERING

In this Section we describe our initial attempt at reconstructing
IoT malware lineage using a traditional feature-based clustering
approach. As explained in Section 3, we eventually adopted a different solution to reach our goal. However, as feature-based clustering
is often used in malware studies, we believe there is a value in
reporting the results of this attempt and discuss the reasons behind
its failure.

### A.1 Foreword on Malware Clustering

Malware clustering has been extensively studied in order to cope
with the increasing sophistication and the rapid increase in the
number of observed samples [5, 6, 23, 25, 29, 38]. As a result, there’s
a long list of works (of which we summarize what we believe to be
the most relevant ones).
A large corpus of works focus on behavior-based malware clustering [5, 6, 29, 38] and typically differ by their used malware features, clustering algorithm and size of the dataset. Bailey et al. [5]
created fingerprints from user-visible system state changes (e.g.,
files written, processes created) and then leveraged a single-linkage
hierarchical clustering algorithm to automatically classify approximately 3.7K samples. Bayer et al. [6] leveraged augmented malware
execution traces and then applied a single-linkage hierarchical
clustering algorithm on 14K samples. Perdisci et al. [38] produced
malware network signatures by using clustering to extract structural similarities in malicious HTTP traffic traces generated by 25K
samples. Kirat et al. [29] built a system that automatically generates
system call-based signatures for 3.1K evasive malware samples and
further grouped those samples using a complete-linkage clustering
algorithm.
Others have looked at static analysis-based malware clustering [23, 25]. Hu et al. [23] proposed MutantX-S to exploit a hashing
trick to reduce static feature dimension and leverage a prototypebased clustering algorithm to resolve the scalability issues faced
by previous malware clustering approaches. Similarly, Jiang et
_al. [25] proposed BitShred to use feature hashing to reduce the_
high-dimensional feature spaces that are common in malware analysis.


[wide-open-backdoor/.](https://blog.trendmicro.com/trendlabs-security-intelligence/netis-routers-leave-wide-open-backdoor/)


10[0] 10[1] 10[2] 10[3] 10[4]

Size [KB]


-----

ACSAC 2020, December 7–11, 2020, Austin, USA Cozzi, et al.


Finally, Li et al. [30] discussed the challenges in evaluating malware clustering, especially when it comes to building an accurate
ground truth. Perdisci et al. [39] also proposed a machine learningbased system to build an AV label-based model against which third
party clustering results can be evaluated.
Note that none of these works have applied their technique on
Linux malware, which brings a lot of challenges related to the
various CPU architectures. Moreover, the size of our dataset (93K
samples) and the high number of (static and dynamic) features led us
to choose the FISHDBC [13] algorithm for our initial feature-based
clustering.

### A.2 Feature Extraction

To analyze each sample, we leverage a free ELF binary analysis
service[4] based on a recent work [12]. The service relies on a combination of static and dynamic analyses to comprehensively evaluate
ELF binaries. It provides runtime behavioral reports via its multiarchitecture sandboxing environment, from which we extract 146
features that belong to five groups. We refer the reader to Appendix B for the complete list of extracted features.

(1) ELF and byte-level features capture low-level characteristics of the binary, such as its architecture, whether it is
statically or dynamically linked, stripped or unstripped, the
number of ELF sections, its file size, the entropy of each
section and its most common bytes, etc.
(2) Binary disassembly features report numerical statistics
extracted with IDA Pro, such as the number of functions,
their complexity, the number of instructions, etc.
(3) Strings includes printable strings extracted from the binary,
grouped into IP addresses, URLs, and UNIX paths.
(4) Runtime behavior covers the information extracted from
the execution of the binary in a sandbox, including whether
the sample was executed correctly, the list of issued system
calls, the different files opened, modified or deleted, whether
the binary has attempted to achieve persistence on the system, etc.
(5) Network traffic features provide a detailed breakdown of
all network connections observed while the binary was running, as extracted by the Zeek (formally Bro) IDS, including
contacted IP addresses, files transferred, domain name resolved, etc.

### A.3 Clustering

Our dataset is large and very complex, containing 93K samples and
146 features, several of them categorical. We converted categorical features to numeric ones with the standard one-hot encoding
technique, whereby each categorical feature becomes a set of n
boolean representing whether each item belongs to each of the n
categories for that feature. For categorical features, we ended up
with a sparse matrix having tens of thousands of columns: such
a large dimensionality is generally very problematic in terms of
scalability for generic clustering algorithms. To deal with it, we use
FISHDBC [13], a density-based clustering algorithm designed for
scalability for complex datasets and arbitrary/non-metric distance
functions. FISHDBC approximates HDBSCAN* [11], an evolution of

[4Padawan: https://padawan.s3.eurecom.fr](https://padawan.s3.eurecom.fr)


the widely known DBSCAN algorithm [16, 40], without generally
compromising in terms of results quality. Due to scalability issues
we could not run HDBSCAN* on our complete dataset, but we confirmed that results of FISHDBC and HDSBCAN* were equivalent
on smaller datasets. This algorithm outputs hierarchical clustering
results in a top-down approach—from the most coarse-grained to
the most fine-grained—and allows to identify the level that yields
the best classification.
We consider numeric and categorical features for each group
separately; for categorical features we pre-process the dataset using
_tf-idf and the Cosine distance, while we use the Euclidean distance_
for numerical features. To empirically assess the impact of feature
groups, we performed 25 rounds of clustering including different
combinations of feature groups, i.e., by including or discarding
some of the five categories.
To get a rough estimation of the quality of the clustering we use
AV labels as a provisional ground truth. In fact, even if some errors
in the label may exist, we still expect to find samples in the same
cluster to largely come from the same family. By using the output of
AVClass, we flag each cluster as one of four categories: (i) Pure if it
contains all samples with the same AV label, (ii) Single if it contains
a combination of samples with the same AV label and unlabelled
samples, (iii) Majority if more than 90% of samples in the cluster
have the same AV label, and (iv) Mixed if it does not fit any of the
previous categories. Table 6 provides a summary of the results of
the 25 rounds of clustering. For the sake of conciseness, we only
provide the best results obtained per combination of feature groups
across all tested weights. Note that the clustering on the IDA Pro
features could only be performed on a restricted set of the 4,960
samples dynamically linked samples, to avoid introducing noise in
the IDA Pro features due to the large amount of embedded library
code. Moreover, the table does not contain results for the network
features alone because network features were too sparse and could
not be used by themselves to build our hierarchical clusters.
Table 6 shows that individual sets successfully identify several
groups of samples belonging to the same family (i.e., pure clusters),
but then also cluster together many samples that have little or nothing in common (e.g., mixed clusters). The results do not improve
much by combining all features, as the limitation of each group
tends to increase the noise in the overall classification. Out of all
combinations we tried in our experiments, the ELF and bytes features alone produced the best clustering results with a total of 44,491
samples in pure clusters and only 14,204 samples in mixed clusters.
However, even in this case roughly one third of our dataset was
placed in majority clusters which erroneously contained samples
of different families.
We then performed an investigation on the resulting clusters
produced by the different feature group combinations. Here we
wanted to understand whether these clusters could be directly used
to group together samples that belong to the same variant or subfamily and, if the answer is affirmative, what exactly was changed
between one version and the other. We first looked at the pure
_clusters. We noticed that all medium-to-large size malware families_
were broken down by our system in many pure clusters. If we consider the combination that produced the best clustering results, i.e.,
the ELF and bytes combination, 20,027 Gafgyt samples were clustered in 1,071 different pure clusters. Also, as many as 13,391 Mirai


-----

The Tangled Genealogy of IoT Malware ACSAC 2020, December 7–11, 2020, Austin, USA


**Table 6: Clustering results: static and dynamic features.**


Feature groups Clusters (# samples)

_pure_ _single_ _majority_ _mixed_

✓ **44,491** **4,657** 31,649 14,204
✓ 3,677 45 316 1,082

✓ 18,141 3,120 23,412 50,328

✓ 27,889 1,097 5,726 60,289

✓ ✓ ✓ ✓ ✓ 34,313 2,337 12,741 45,610

✓ ✓ ✓ ✓ 38,825 3,062 24,234 27,531

✓ ✓ ✓ 39,904 2,495 17,667 33,586

✓ ✓ 42,427 2,587 **34,118** 14,520
✓ ✓ ✓ 20,822 983 12,964 58,883

samples populated a total of 654 pure clusters. Initially, this would
make them good candidates for our sub-family investigation. As
expected, indeed different clusters often captured different common
features of the samples. For example, they separated dynamically
vs statically linked binaries, or those samples that successfully executed in our VM from those that did not (and therefore resulted
in an empty dynamic behavior profile). However, our goal was not
to distinguish Mirai samples that were dynamically or statically
linked, but rather identify its evolution over time. Unfortunately, the
resulting clusters did not capture our need to isolate sub-families
but rather samples that produced similar features (e.g., two samples
that immediately terminate with an error message are not necessarily similar, despite the common behavior). During the manual
investigation of the clustering results, we also noticed that the captured runtime and network behavior of different variants of the
same family, when not missing, were often identical or so similar
that the clustering algorithm would hardly differentiate them. For
example, most variants of Mirai would follow the same high-level
process after the device is compromised: (i) reach out to the C&C
server, (ii) retrieve some target IP addresses to scan for worm-like
replication, (iii) launch scanning, (iv) receive DDoS attack target(s),
and (v) launch DDoS attack(s). This hinders the identification of
variants from such a trace. Additionally, considering finer-grained
features is likely to introduce overly specific clusters.
We also manually investigated those clusters that contained
samples with different AV labels. In particular, we looked at those
that had a predominant number of samples with a consistent AV
label, and a small number of samples with a different one (majority
_clusters), e.g., (gafgyt: 33), (aidra:2). While intuitively this_
could have been the result of errors in AV classifications, after
dozens of manual investigations we could not find a single mislabeled sample. Please remember that this does not mean there
were no errors in individual AV labels (we did find several of those),
but that by applying the majority voting provided by AVclass the
result (when a consensus was reached) was always correct. Errors
in the majority voting also existed, as explained in more details in
Section 4.2, but we needed a more precise clustering to successfully
isolate them from the noise.
Traditional clustering based on static and dynamic features was
insufficient to identify meaningful similarities and isolate variations


among sub-families. In particular, when applied to a large dataset,
the number of errors largely exceeded the ability to manually investigate and correct the results. Dynamic features (for example
those extracted from runtime behaviour or network traffic) failed
to accurately classify samples even into coarse-grained malware
families. On the other hand, we observed that static features (for
instance ELF features) would produce very compact micro clusters
sensitive to very fine-grained changes in the binary representation
of malware samples. While this was more successful to group together samples belonging to the same family, such over-sensitive
classification turned out to be inappropriate for the identification of
IoT malware variants. This contrasts with previous clustering and
lineage works e.g., on Windows [37], where malware programs
express more unique behaviors compared to the IoT counterpart
seen to date.

### B LIST OF STATIC AND DYNAMIC FEATURES


**Feature name: Description**

**bytes.common_bytes: List of the three most common bytes (with counter)**
**bytes.entropy: The entropy of the binary**
**bytes.header: First 16 bytes of the file**
**bytes.footer: Last 16 bytes of the file**
**bytes.longest_sequence.length: Longest sequence of the same byte (byte, offset, length)**
**bytes.min_entropy: Lowest entropy among 16K bytes blocks**
**bytes.max_entropy: Highest entropy among 16K bytes blocks**
**bytes.null_bytes: Number of null (0) bytes**
**bytes.printable: Number of printable bytes**
**bytes.rarest_bytes: List of the three rarest bytes (with counter)**
**bytes.unique_bytes: Number of unique bytes (0-255)**
**bytes.white_spaces: Number of white-spaces (0x32,\n,\r,\t) bytes**


**elf.anomalies.ehph_diff: Difference between segment virtual address and file offset**
**elf.anomalies.entrypoint.permission: Anomalous entrypoint: Permission**
**elf.anomalies.entrypoint.section: Anomalous entrypoint: Section**
**elf.anomalies.entrypoint.segment: Anomalous entrypoint: Segment**
**elf.anomalies.sections.cpp_prelink: Anomalous sections: C++ prelink section**
**elf.anomalies.sections.grub_module: Anomalous sections: Grub module**
**elf.anomalies.sections.headers: Anomalous sections: Wrong number of section headers**
**elf.anomalies.sections.high_entropy: Anomalous sections: High entropy**
**elf.anomalies.sections.kernel_object: Anomalous sections: Kernel object**
**elf.anomalies.sections.section_header_null: Anomalous sections: Null section headers**
**elf.anomalies.sections.shentsize_empty: Size of section header table’s entry null**
**elf.anomalies.sections.shnum_empty: Anomalous sections: Number of section headers empty**
**elf.anomalies.sections.shnum_pastfile: Anomalous sections: Section header table beyond file**
**elf.anomalies.sections.shoff_empty: Anomalous sections: Section header table offset empty**
**elf.anomalies.sections.shoff_pastfile: Anom. sec.: Section header table offset beyond file**
**elf.anomalies.sections.uncommon: Anomalous sections: Uncommon sections**
**elf.anomalies.sections.wrong_shstrndx: Anom. sec.: Wrong section name string table index**
**elf.anomalies.segments.error: Error in segments table**
**elf.anomalies.segments.headers: Anomalous segments: Wrong number of program headers**
**elf.anomalies.segments.high_entropy: Anomalous segments: High entropy**
**elf.anomalies.segments.high_mem: Segment memory size much bigger than physical size**
**elf.anomalies.segments.wx: Anomalous segments: W&X permission**
**elf.class: ELF file’s class**
**elf.comment: .comment section of the ELF, if present**
**elf.data: Data encoding of the-specific data**
**elf.debug: If the binary contains debug information (compiled with -g)**
**elf.dynfuncs: Dynamic symbols being used, of type FUNC in particular**
**elf.entrypoint: Binary entrypoint**
**elf.e_phentsize: Size in bytes of one entry in the program header table**
**elf.e_phnum: Number of entries in the program header table**
**elf.e_phoff: Program header table’s file offset in bytes**
**elf.e_shentsize: Size in bytes of one entry in the section header table**
**elf.e_shnum: Number of entries in the section header table**
**elf.e_shoff: Section header table’s file offset in bytes**
**elf.e_shstrndx: Index of section header table containing section names**
**elf.gdb: Error raised by gdb**
**elf.interpreter: ELF’s declared interpreter**
**elf.link: Statically or dynamically linked**
**elf.machine: Required architecture for the file**
**elf.malformed.entrypoint: Malformed ELF: Wrong entrypoint**
**elf.malformed.pastload: Malformed ELF: Beyond LOAD segment**
**elf.malformed.pastphnum: Malformed ELF: Beyond program header table**
**elf.malformed.pastsegment: Malformed ELF: Beyond segment**
**elf.needed: DT_NEEDED entries for dynamic ELF files**
**elf.note: .note.* sections of the ELF, if present**
**elf.nsections: Number of sections**
**elf.nsegments: Number of segments**
**elf.osabi: Operating system/ABI identification**


-----

ACSAC 2020, December 7–11, 2020, Austin, USA Cozzi, et al.


**elf.pyelftools: Exception raised by pyelftools, if any**
**elf.readelf: Error raised by readelf**
**elf.soname: PT_SONAME entry for dynamic ELF files**
**elf.stripped: Whether the binary has been stripped or not**
**elf.stripped_sections: Whether the sections table of the binary has been stripped or not**
**elf.type: Object file type**

**strings.ip: Potential IPs (v4 and v6) found in the binary**
**strings.path: Potential UNIX paths found in the binary**
**strings.url: Potential URLs found in the binary**

**idapro.average_bytes_func: Average size in bytes of a function**
**idapro.avg_basic_blocks: Average number of basic blocks respect to functions**
**idapro.avg_cyclomatic_complexity: Average cyclomatic complexity respect to functions**
**idapro.avg_loc: Average lines of code respect to functions**
**idapro.branch_instr: Number of branch instructions**
**idapro.bytes_func: Total size in bytes of the functions**
**idapro.call_instr: Number of call instructions**
**idapro.func_loc: Percentage of instructions belonging to functions**
**idapro.indirect_branch_instr: Number of indirect branch instructions**
**idapro.loc: Explored lines of code**
**idapro.max_basic_blocks: Max basic blocks**
**idapro.max_cyclomatic_complexity: Max cyclomatic complexity**
**idapro.nfuncs: Number of functions detected**
**idapro.percent_load_covered: Percentage of covered load segment**
**idapro.percent_text_covered: Percentage of covered text section**
**idapro.syscall_instr: Number of syscall instructions**

**behavior.user.argv0_rename: Procs renaming argv0**
**behavior.user.askroot: Wheter the execution got permission related errors**
**behavior.user.checkgid: If gid is checked**
**behavior.user.checkuid: If uid is checked**
**behavior.user.cmds: System cmds**
**behavior.user.compare: strcmp or memcmp comparison**
**behavior.user.cve: Possible CVEs exploited**
**behavior.user.dropped.create: Dropped files: Create**
**behavior.user.dropped.link: Dropped files: Link**
**behavior.user.dropped.linkfrom: Dropped files: Link from**
**behavior.user.dropped.modify: Dropped files: Modify**
**behavior.user.empty: Empty or no trace**
**behavior.user.errors.enosys: Errors from execution: Syscall not implemented**
**behavior.user.errors.execfault: Errors from execution: Execution fault**
**behavior.user.errors.illegal: Errors from execution: Illegal instruction**
**behavior.user.errors.missinglibs: Errors from execution: Missing library**
**behavior.user.errors.segfault: Errors from execution: Segmentation fault**


**behavior.user.errors.sigbus: Errors from execution: Bus error**
**behavior.user.errors.wronginterp: Errors from execution: Wrong interpreter**
**behavior.user.ioctl.fail: Ioctls: Fail**
**behavior.user.ioctl.success: Ioctls: Success**
**behavior.user.ioctl.total_no: Ioctls: Total number**
**behavior.user.libccalls.total_no: Libc calls from execution: Total number**
**behavior.user.libccalls.unique: Libc calls from execution: Unique**
**behavior.user.libccalls.unique_no: Libc calls from execution: Unique number**
**behavior.user.lineslost: Amount of trace lines not correctly parsed**
**behavior.user.persistence.create: Sample persistence: Create**
**behavior.user.persistence.link: Sample persistence: Link**
**behavior.user.persistence.linkfrom: Sample persistence: Link from**
**behavior.user.persistence.modify: Sample persistence: Modify**
**behavior.user.proc_rename: Procs renaming**
**behavior.user.procs: Number of processes spawned**
**behavior.user.ptrace_request: Ptrace requests**
**behavior.user.read_only: Files being read**
**behavior.user.rooterr.EACCES: EACCES type of permission related error**
**behavior.user.rooterr.EPERM: EPERM type of permission related error**
**behavior.user.sleep_max: Max sleep**
**behavior.user.syscalls.total_no: Syscalls from execution: Total number**
**behavior.user.syscalls.unique: Syscalls from execution: Unique**
**behavior.user.syscalls.unique_no: Syscalls from execution: Unique number**
**behavior.user.unlink: Unlink files**
**behavior.user.unlink_itself: Unlink itself**

**dynamic.error: Errors encountered during sandboxing**
**dynamic.stderr: Standard output during analysis**
**dynamic.stdout: Standard error during analysis**

**nettraffic.conn.avg_duration: Average duration of connections**
**nettraffic.conn.bytes: Number of bytes exchanged**
**nettraffic.conn.conns: Number of connections**
**nettraffic.conn.ips: List of unique IP addresses contacted**
**nettraffic.conn.pkts: Number of packets exchanged**
**nettraffic.conn.ports: List of unique destination ports**
**nettraffic.dns.qry_resp: List of unique DNS queries and their responses**
**nettraffic.dns.queried_domains: List of unique domains resolved through DNS**
**nettraffic.files.dropped_files_hash: List of unique hashes (SHA-256) of dropped files**
**nettraffic.files.dropped_files_mimetype: List of unique MIME types of dropped files**
**nettraffic.files.dropped_files_source_ips: List of unique IP addresses from which dropped files**
have been downloaded
**nettraffic.files.dropping_protos: List of unique protocols used to drop files**
**nettraffic.ssl.ssl_domains: List of unique domains contacted over SSL/TLS**


-----