# Predicting Domain Generation Algorithms with Long Short-Term Memory Networks

### Jonathan Woodbridge, Hyrum S. Anderson, Anjum Ahuja, and Daniel Grant


### {jwoodbridge,hyrum,aahuja,dgrant}@endgame.com Endgame, Inc. Arlington, VA 22201


**_Abstract—Various families of malware use domain generation_**
**algorithms (DGAs) to generate a large number of pseudo-random**
**domain names to connect to a command and control (C2) server.**
**In order to block DGA C2 traffic, security organizations must**
**first discover the algorithm by reverse engineering malware**
**samples, then generate a list of domains for a given seed. The**
**domains are then either preregistered, sink-holed or published**
**in a DNS blacklist. This process is not only tedious, but can**
**be readily circumvented by malware authors. An alternative**
**approach to stop malware from using DGAs is to intercept DNS**
**queries on a network and predict whether domains are DGA**
**generated. Much of the previous work in DGA detection is based**
**on finding groupings of like domains and using their statistical**
**properties to determine if they are DGA generated. However,**
**these techniques are run over large time windows and cannot be**
**used for real-time detection and prevention. In addition, many of**
**these techniques also use contextual information such as passive**
**DNS and aggregations of all NXDomains throughout a network.**
**Such requirements are not only costly to integrate, they may not**
**be possible due to real-world constraints of many systems (such**
**as endpoint detection). An alternative to these systems is a much**
**harder problem: detect DGA generation on a per domain basis**
**with no information except for the domain name. Previous work**
**to solve this harder problem exhibits poor performance and many**
**of these systems rely heavily on manual creation of features;**
**a time consuming process that can easily be circumvented by**
**malware authors. This paper presents a DGA classifier that**
**leverages long short-term memory (LSTM) networks for real-time**
**prediction of DGAs without the need for contextual information**
**or manually created features. In addition, the presented technique**
**can accurately perform multiclass classification giving the ability**
**to attribute a DGA generated domain to a specific malware family.**
**The technique is extremely easy to implement using open source**
**tools allowing the technique to be deployed in almost any setting.**
**Results are significantly better than all state-of-the-art techniques,**
**providing 0.9993 area under the receiver operating characteristic**
**curve for binary classification and a micro-averaged F1 score of**
**0.9906. In other terms, the LSTM technique can provide a 90%**
**detection rate with a 1:10000 false positive (FP) rate—a twenty**
**times FP improvement over the next best method. Experiments**
**in this paper are run on open datasets and code snippets are**
**provided to reproduce the results.**

I. INTRODUCTION


Many malware families contain domain generation algorithms (DGAs) to make preemptive defenses difficult. Domains
are generated pseudo-randomly in bulk (hundreds to tens-ofthousands per day) by a malware sample. The malware then
attempts to connect to all or a portion of these generated
domains in hopes of finding a command and control (C2)


server from which it can update, upload gathered intelligence,
or pursue other malicious activities. The malicious actor only
needs to register a small number of these domains to be
successful. However, all the domains must be sinkholed,
registered, or blacklisted before they go into use in order
to preemptively defeat such an attack. This defense becomes
increasingly difficult as the rate of dynamically generated
domains increases.

Authors in [1] presented a thorough review of the efficacy
of blacklists. As a part of this review, authors analyzed
both public and private blacklists for DGA coverage, (i.e.,
how many domains generated by DGAs were contained in
blacklists). Public blacklists were surprisingly lacking in terms
of DGA coverage with less than 1.2% of DGAs analyzed by
the authors being contained in any of the blacklists. Vendor
provided blacklists fared better, but had mixed results over
malware families with coverage varying from 0% to 99.5%.
These results suggest that blacklists are useful, but must be
supplemented by other techniques to provide a more adequate
level of protection.

Another approach to combating malware using DGAs is to
build a DGA classifier. This classifier can live in the network
sniffing out DNS requests and looking for DGAs. When DGAs
are detected, the classifier notifies other automated tools or
network administrators to further investigate the origin of
a DGA. Previous work in DGA detection can be broken
down into two categories: retrospective detection and real-time
detection. Retrospective detection makes bulk predictions on
large sets of domains and are designed as a reactionary system
that cannot be used for real-time detection and prevention [2],

[3], [4]. In these systems, sets of domains are broken down into
groupings using clustering with the intent to generate statistical
properties of each grouping. Classification is accomplished
by generating templates during training and using statistical
tests (e.g., Kullback-Leibler divergence) to classify groups
of potential DGAs. In addition, these techniques incorporate
contextual information such as HTTP headers, NXDomains
across a network, and passive DNS to further improve performance. Much of the previous work in DGA detection falls
in the former category and, unfortunately, does not meet the
needs of many real-world security applications that require
real-time detection and prevention [5]. In addition, it is often
unrealistic for many security applications to use contextual
information. For example, endpoint detection and response
(EDR) systems run on endpoints and hosts and have strict
performance requirements on processing, network, and mem

-----

ory usage. Aggregating such contextual information from the
network to each endpoint requires far too much overhead and
is not practical for a real-world deployment.

Real-time detection techniques attempts to classify domains as DGA generated on a per domain basis using only the
domains’ names (i.e., no additional contextual information).
Real-time detection is a considerably harder problem than
retrospective techniques and techniques often exhibit performance far too low for a real-world deployment. (Suprisingly,
authors in [5] found that retropsective techniques had similarly
bad performance!) Many of the previous real-time approaches
use hand picked features (e.g., entropy, string length, vowel
to consonant ratio, etc.) that are fed into a machine learning
model, such as a random forest classifier. Using hand-crafted
features have two major drawbacks. First, hand-crafted features
are easy to circumvent. Second, deriving hand-crafted features
is a time consuming process. If, and when, a malicious actor
derives a new DGA family around beating a set of features,
security professionals will need to spend considerable time
creating new features. To the best of our knowledge, authors
in [2] presented the first (and only until this paper) featureless
real-time technique by using Hidden Markov Models (HMMs).
However, as shown later in the paper, HMMs perform quite
poorly on detecting DGAs. To note, the HMMs in [2] were
part of a much larger retrospective detection system.

This paper presents a feature-less real-time technique using Long Short-Term Memory networks (LSTMs) to classify
DGAs. This technique has four significant advantages over
other techniques in the literature. First, the LSTM DGA
classifier is featureless, in that it operates on raw domain names
(e.g., google.com, facebook.com, etc.). If a new family of DGA
appears, then the classifier can be retrained without the tedious
step of hand picking features. LSTMs work largely as a black
box making it very difficult for adversaries to reverse engineer
and beat a classifier without the same training set. Second,
the presented technique has a significantly better true positive
rate/false positive rate over previously published retrospective
and real-time approaches. Third, the technique also works in
a multiclass classification setting. Therefore, the algorithm not
only provides a binary decision of whether a domain is DGA
or not, but can accurately fingerprint a unique DGA’s structure. Fourth, the presented algorithm can classify in real-time
using absolutely no contextual information. Classification of a
domain takes 20 ms on commodity hardware.[1] The technique
is trivial to implement and can run on virtually any security
environment. In fact, all the code required to implement this
system is provided in this paper demonstrating its ease of
deployment.

In this paper, we make the following contributions. We

1) introduce an LSTM network to predict DGA generated domains, which to our knowledge, is the first
application and in-depth analysis of deep learning to
this domain;
2) present complete experimental results showing significant improvements over previous techniques (both
real-time and retrospective) in the literature using
open datasets; and
3) provide source code to reproduce results.

1Apple MacBook Pro with a 2.2 GHz Intel Core i7 and 16GB of memory


To allow for easily reproducible results, Python source code
built on the open source framework Keras [6] is provided.
Experiments were run on GPU hardware, but it’s possible to
run all experiments on commodity desktop or laptop hardware.
An overview of LSTMs and previous work is discussed in
Section II. Details of reproducing the results are given in
Sections III and IV. Full results are given in Section V with
suggestions for future work in Section VI.

II. BACKGROUND

Domain fluxing is a technique used by botnets and
command-and-control (C2) servers to create many domains
using a Domain Generation Algorithm (DGA) [7], [8]. All
botnets and C2 servers in the same infrastructure use the
same seeded algorithm such that they all create the same
pseudorandomly generated domains. A subset of these domains
are registered by the C2 servers while each botnet iterates
through the DGA generated domains until it finds one that
is registered. To further complicate the process, C2 servers
continually switch to new DGA generated domains making
blacklist creation and take down efforts difficult.

One approach to combating domain fluxing is to reverse
engineer a piece of malware and its respective DGA [8]. Once
a DGA and its respective seed is known, future domains can be
registered and used as an impostor C2 server to hijack botnets
(a process known as sinkholing). Once a campaign has been
hijacked, adversaries must redeploy new botnets with updated
seeds to continue.

Blacklisting is another approach to combat domain fluxing

[1]. DGA generated domains are added to a blacklist that can
be used by a network administrator to block connections to
potential C2 servers. However, both blacklists and sinkholing
are only effective when both the algorithm and seed used by
a campaign is known.

_A. Domain Generation Algorithms_

This paper evaluates the ability to classify DGA generated
domains from 30 different types of malware. Malware families
include ransomware, such as Cryptolocker [9], [10] and
Cryptowall [11], banking trojans, such as Hesperbot

[12], and general information-stealing tactics, such as ramnit

[13].

DGA techniques vary in complexity from simple uniformly
generated domain names to those that attempt to model distributions that are seen in real domains. ramnit, for example,
creates domains with a series of divides, multiplies and modulos computed on a seed [13] while suppobox creates domains
by concatenating two random strings (typically taken from the
English language) [14].

Predicting DGA generated domains from such algorithms
as suppobox is extremely difficult without using contextual
information. In fact, the LSTM technique presented in this
paper was the only real-time technique able to classify such
domains.

_B. DGA Classification_

DGA classification can be a useful component of a domain
reputation system. Domain reputation systems have the task of


-----

assigning a trustworthy score of a domain. This score typically
varies from 0 (most benign) to 1 (most malicious). Domain
reputation systems typically incorporate many pieces of heterogeneous data, such as passive DNS (pDNS), to make decisions
on a domain’s reputation [15], [16], [17]. DGA classification
is one piece of information that can help assign a reputation
to a domain. Previous approaches to DGA classification can
be roughly broken down into two categories:

1) _Retrospective: classifying domains in groups to take_
advantage of bulk statistical properties or common
contextual information; and
2) _Real-time: classifying domains individually with no_
additional contextual information.

Authors in [3], [4] detect DGAs by using both unigram
and bigram statistics of domain clusters. The training set is
separated into two subsets: those generated by a DGA and
those not generated by a DGA. The distributions of both
unigrams and bigrams are calculated for both the subsets. Classification occurs in batches. Each batch of unknown domains is
clustered by shared second level domain and domains sharing
the same IP address. The unigram and bigram distributions are
calculated for each cluster and compared to the two known
(labeled) subsets using the Kullback-Leibler (KL) distance.
In addition, the authors use the Jaccard distance to compare
bigrams between clusters and the known (labeled) sets as well.

Authors in [2] apply a similar clustering process to classify domains with unsuccessful DNS resolutions. To train,
statistical features are calculated for each subset of labeled
DGA generated domains, such as Bobax, Torpig, and
Conficker.C. Unknown domains are clustered by statistical
characteristics such as length, entropy, and character frequency
distribution, as well as shared hosts requesting the domain
(i.e., cluster two domains together if the same host made a
DNS query for both domains). Next, statistical features are
calculated for each cluster and compared to the training subsets
to classify the clusters as formed by a known DGA. If a cluster
is classified as belonging to a known DGA, the host is deemed
to be infected.

Once a host is deemed to be infected with a DGA-bot,
the authors attempt to identify the bots active C2 server. This
stage of the process uses a Hidden Markov Model trained on
each known family of DGA and applied to single domains
(i.e., this technique follows the same assumptions as the
LSTM technique proposed by this paper). Each domain with a
successful DNS request is fed through each HMM. If a domain
receives an adequate score (i.e., greater than some threshold
_θ), the domain is labeled as a DGA. The threshold is learned_
at training time and set to a maximum false positive rate of
1%. We use this HMM technique as one of our comparisons
to previous work.

The aforementioned techniques (with exception to the
HMM technique in [2]) are accomplished retrospectively.
Authors in [5] perform an in-depth comparison of these techniques and discuss two important findings. First, retrospective
techniques are too slow for most real-world deployments and
often take hours to detect malicious domains. Second, the
performance of these systems are quite poor in terms of
false positives and true positives. These authors present their
own technique that overlaps both retrospective and real-time


techniques. They apply an online form of sequential hypothesis
testing to NXDomains only. Clients in a network are given an
evolving score based on the number and maliciousness of NXDomains. A client can be labeled as malicious or benign once
its score goes above or below predefined thresholds. While
this system is a big improvement over retrospective systems,
it has three main drawbacks. First, detection is not always in
real-time as a client takes time to build an appropriate score.
Authors reported that only 83% of domains were detected in
time to prevent a connection. Second, performance of their
system is considerably less than most real-time solutions as
we show in section V. Third, their system cannot perform
multiclass classification as their system bases classification
solely on the presence of NXDomains.

Authors in [18] present a real-time DGA classifier that
uses two basic linguistic features named meaningful characters
_ratio and n-gram normality score. The meaningful characters_
_ratio calculates the ratio of characters in a domain that_
comprise of a meaningful word. For example, facebook has
a ratio of 1 as all character in the domain are covered by
the words face and book while face1234 has a ratio of 0.5
as only half of its character are covered by the word face.
The n-gram normality score is calculated by finding n-grams
with n ∈ 1, 2, 3 within a domain and calculating their count in
the English language. The mean and covariance of these four
features are calculated from a benign set (Alexa top 100,000).
Unknown domains are then classified by their Mahalanobis
distance to the benign set (i.e. a larger distance is indicative
of a DGA generated domain).

The approach in [18] is used as a filter step. Once domains
have been classified as a DGA they are fed to a clustering
technique (similar to those described above) to further classify
the domains.

Section V shows a comparison of our technique to both retrospective and real-time systems. Our technique significantly
outperforms retrospective techniques and the comparison is
brief and compares findings to those in [5]. An in depth
comparison is performed between our technique and the aforementioned real-time systems. More specififcally, we compare
our technique to the HMM defined by [2] as well as a Random
Forest Classifier trained on features defined in [2], [3], [4],

[18]. We do not perform an in depth comparison on the full
systems as defined in [2], [3], [4] as they are retrospective
systems and have already been shown to perform far worse
than our system [5].

_C. LSTM Networks_

In a variety of natural language tasks, recurrent neural networks (RNNs) have been used to capture meaningful temporal
relationships among tokens in a sequence [19], [20], [21], [22].
The key benefit of RNNs is that they incorporate contextual
(state) information in their mapping from input to output.
That is, the output of a single RNN cell is a function of the
input layer and previous RNN activations. Due to long chains
of operations that are introduced by including self-recurrent
connections, the output of a traditional RNN may decay
exponentially (or, more rarely but catastrophically explode) for
a given input, leading to the well-known vanishing gradients
problem. This makes learning long-term dependencies in an
RNN difficult to achieve.


-----

The problem of vanishing gradients is a key motivation behind the application of the Long Short-Term Memory (LSTM)
cell [23], [24], [25], which consists of a state that can be
read, written or reset via a set of programmable gates. The
cell’s state has a self-recurrent connection that allows the
cell to exactly retain state between time steps. However, that
state may be modulated by a new input via an input gate,
which effectively multiplies the input by a number that ranges
between 0 and 1 (sigmoid activation) or -1 and 1 (tanh
activation). Likewise, a forget gate modulates the self-recurrent
state connection by a number between 0 and 1. Thus, if the
input gate modulates the input with 0, and the forget gate
modulates the recurrent connection with 1, the cell ignores the
input and perfectly retains state. On the other hand, a 1 (input)
and a 0 (forget) causes the cell’s state to be overwritten by the
input. And in the case of a 0 (input) and 0 (forget), the state is
reset to 0. Finally, an output gate modulates the contribution
of the cell’s state to the output, which propagates to the input
gates of LSTM cells across the layer, as well as to subsequent
layers of the network.

The LSTM cell’s design with multiplicative gates allows a
network to store and access state over long sequences, thereby
mitigating the vanishing gradients problem. For our use with
domain names, the state space is intended to capture combinations of letters that are important to discriminating DGA
domains from non-DGA domains. This flexible architecture
generalizes manual feature extraction via bigrams, for example,
but instead learns dependencies of one or multiple characters,
whether in succession or with arbitrary separation.

III. METHOD

We employ an LSTM network for detecting DGAs. The
model has the following advantages:

_•_ the model accepts variable-length character sequences
as input, so that there is no auxiliary requirement for
feature extraction[2];

_•_ the model is very compact, comprised simply of an
embedding layer, an LSTM network layer, and a fully
connected output layer that is simple logistic (or for
multiclass, multinomial logistic) regression; and

_•_ although training on a large dataset is computationally
intensive, the shallow structure allows for very fast
query times.

A graphical depiction of our model is shown in Fig. 1.
To prevent overfitting when training neural networks, it is
common practice to employ dropout. Dropout consists of
randomly removing a random subset of edges between layers
of a network during each iteration of training, but restoring
their contribution at test time. We apply dropout after the
LSTM layer prior to logistic regression.

The embedding layer projects ℓ-length sequences of input
characters from the input domain S ⊂Z _[ℓ]_ to a sequence of
vectors R[d][×][ℓ], where ℓ is an upper bounded length determined
from the training set. The input domain consists of nonredundant valid domain name characters (lowercase alphanumeric, period, dash and underscore), and the output dimension

2In experiments, we employ a trivial pre-processing step to remove top-level
domains and convert all characters to lowercase.


input sequence

embedding layer

LSTM layer

logistic regression

output probability

Fig. 1: Our model consists of an embedding layer, an LSTM
layer that serves essentially as a feature extractor, and a logistic
regression classifier.

_d is a tunable parameter that represents an embedding. In our_
model, we choose d = 128 > |S| to provide additional degrees
of freedom to the model, but preliminary experiments showed
that results are relatively insensitive to the particular choice of
_d._

The LSTM layer can be thought of as implicit feature
extraction, as opposed to explicit feature extraction (e.g., ngrams) used in other approaches. Rather than represent domain
names explicitly as a bag of bigrams, for example, the LSTM
learns patterns of characters (or in our case, embedded vectors)
that maximize the performance of the second classification
layer. In our experiments we compare the LSTM model to
an explicit bigram logistic regression model.

All LSTM code was written in Python using the Keras
framework [6]. Two models are generated: one for a binary
classification and one for a multiclass classification. Code for
the binary classification is shown in Fig. 2 and the multiclass
classification in Fig. 3.

The two code examples have a few small differences. The
final dense layer goes from an output of one value in the binary
classifier (line 15) to nb_classes in the multiclass classifier
(line 17). A binary decision only requires a single value from

[0, 1] where 0 is the most benign and 1 is the most DGA.
The multiclass model produces nb_classes scores, one for
each family known by the classifier, where multinomial logistic
regression is employed on softmaxed activations on line 18 to
encode a distribution that sums to unity.

IV. EXPERIMENTAL SETUP

In the following section, we describe details of our experimental setup in evaluating DGA classifiers in a binary
experiment (DGA vs. non-DGA) and multiclass experiment
(which DGA?) using publically available domain names and
DGA data.

_A. Evaluation Metrics_

Precision, Recall, F1 score, and Receiver Operating Characteristic (ROC) are the four evaluation metrics used to compare the LSTM classification technique to other state-of-the-art
techniques. Precision is defined as


-----

1 **from keras.preprocessing import pad_sequences** 1 **from keras.preprocessing import pad_sequences**
2 **from keras.models import Sequential** 2 **from keras.models import Sequential**
3 **from keras.layers.core import Dense** 3 **from keras.layers.core import Dense**
4 **from keras.layers.core import Dropout** 4 **from keras.layers.core import Dropout**
5 **from keras.layers.core import Activation** 5 **from keras.layers.core import Activation**
6 **from keras.layers.embeddings import Embedding** 6 **from keras.layers.embeddings import Embedding**
7 **from keras.layers.recurrent import LSTM** 7 **from keras.layers.recurrent import LSTM**
8 8
9 model=Sequential() 9 model=Sequential()
10 model.add(Embedding(max_features, 10 model.add(Embedding(max_features,
11 128, 11 128,
12 input_length=75)) 12 input_length=75))
13 model.add(LSTM(128)) 13 model.add(LSTM(128))
14 model.add(Dropout(0.5)) 14 model.add(Dropout(0.5))
15 model.add(Dense(1)) 15 _# nb_classes is the number of classes in_

16 _# the training set_

16 model.add(Activation(’sigmoid’))

17 model.add(Dense(nb_classes))

17

18 model.add(Activation(’softmax’))

18 model.compile(loss=’binary_crossentropy’,

19

19 optimizer=’rmsprop’)

20 model.compile(loss=’categorical_crossentropy’,

20

21 optimizer=’rmsprop’)

21 _# Pad sequence where sequences are case_

22

22 _# insensitive characters encoded to_

23 _# Pad sequence where sequences are case_

23 _# integers from 0 to number of valid_

24 _# insensitive characters encoded to_

24 _# characters_

25 _# integers from 0 to number of valid_

25 X_train=sequence.pad_sequences(X_train,

26 _# characters_

26 maxlen=75)

27 X_train=sequence.pad_sequences(X_train,

27

28 maxlen=75)

28 _# Train where y_train is 0-1_

29

29 model.fit(X_train, y_train,

30 _# Train where y_train is one-hot encoded for_

30 batch_size=batch_size, nb_epoch=1) 31 _# each class_

32 model.fit(X_train, y_train,

Fig. 2: Binary LSTM Code

33 batch_size=batch_size, nb_epoch=1)

Fig. 3: Multiclass LSTM Code

� True Positive
Precision = _,_
� True Positive + � False Positive

and measures the purity of all positively labeled instances (i.e., � False Positive

FPR = _._

the ratio of correct positively labeled instances to all positively
labeled instances). Recall is defined as � False Positive + � True Negative

The ROC is generated by evaluating the TPR and FPR at all

� True Positive
Recall = _,_ thresholds of score returned by a classifier. For example, the
� True Positive + False Negative
� ROC is calculated for a probabilistic classifier by varying a

threshold from 0.0 to 1.0 and calculating FPR and TPR for

and measures the completeness of positively labeled instances

each value in the range. Area under the curve (AUC) is a

(i.e., the ratio of correct positively labeled instances to all

common single metric to compare ROC curves, and as the

instances that should have been labeled positive). F1 score name implies, is just the area under the ROC curve. An AUC
is the harmonic mean of Precision and Recall:

of 1 is perfect, and an AUC of 0.5 is the same as chance in a
binary classifier.

_F1 = 2 ·_ [Precision][ ·][ Recall]

Precision + Recall _[.]_ Averaging results over classes is done using both a micro

and macro average. Micro averaging takes into account the

ROC measures the trade-off of the true positive rate (TPR) number of elements in the test set. This means that smaller
to false positive rate (FPR) where classes will account for less in the average than larger classes.

Macro, on the other hand, averages over all classes regardless
of the number of elements in each individual class. For

� True Positive this paper, macro averaging is probably a better predictor
TPR = _,_
� True Positive + � False Negative of performance as the distributions of classes in our dataset

may not accurately represent the true distributions in the wild.

and However, both measures are provided for completeness.


-----

_B. Experimental Designs_

The proposed technique is evaluated using three different
experimental designs:

1) binary classification with random holdout test sets to
measure the general ability to detect DGA vs. nonDGA,
2) binary classification with holdout DGA algorithm
families to measure the ability to detect new DGAs,
and
3) multiclass classification to measure the ability to
distinguish one DGA algorithm from another.

The binary classification experimental design tests each
DGA classifier for it’s ability to make an accurate binary
decision: DGA or not DGA. The DGA class consists of
domains from all thirty families in our training set. This
experiment is run using n-fold cross validation with ten folds.
Evaluation is accomplished with both an ROC as well as a
detailed Precision, Recall and F1 score broken down by each
class. Both the micro and macro averages of Precision, Recall
and F1 score are also given.

In the second experiment, we test each classifier’s ability
to discover new DGA families not used in the training set.
The ten smallest DGA families are removed from the dataset
and each classifier is trained on all samples from the remaining
classes. Precision, Recall and F1 score is calculated on the test
set. In addition, we find both the micro and macro average of
these scores over all classes for each algorithm.

The multiclass classification design tests each DGA classifier for its ability to make an accurate decision on the family
of DGA. The random forest DGA classifier (using manual
features) uses a One vs. Rest while the LSTM and Bigram
classifiers do a direct multiclass classification. We display a
class breakdown of Precision, Recall and F1 score for each
class as well as the micro and macro average.

_C. Data_

This paper uses open datasets for reproducibility. A realworld system should use an expanded dataset to make it more
difficult for an adversary to reverse engineer and defeat the
classifier. The experimental designs use data from two sources.

1) The Alexa top 1 million domains [26] are used for
training domains that are not DGAs.
2) The OSINT DGA feed from Bambenek Consulting

[27] is used for DGA domains.

The OSINT DGA feed consists of thirty families of DGAs
with a varying number of examples from each class. This feed
contains approximately 750,000 DGA examples.

_D. Comparison to state of the art_

For each experiment, we compare the featureless LSTM
DGA classifier to

_•_ a featureless HMM model[3] defined in [2],

3HMM is excluded from the multiclass experiment due to poor performance.


_•_ logistic regression on character bigrams (simple features), and

_•_ a random forest DGA classifier using manually-crafted
domain features defined in [2], [3], [4], [18].

In particular, the manually crafted features of the random
forest DGA classifier include the following:

_•_ length of domain name,

_•_ entropy of character distribution in domain name,

_•_ vowel to consonant ratio,

_•_ Alexa 1M _n-gram_ frequency distribution cooccurrence count, where n = 3, 4 or 5,

_•_ _n-gram normality score, and_

_•_ _meaningful characters ratio._

Note that for the n-gram normality score, we use n = 3, n = 4
and n = 5 as three distinct features as opposed to n = 1, n = 2
and n = 3 as in [18] since the larger n-gram size performed
better in preliminary experiments. In addition, features were
trained in a random forest DGA classifier as opposed to a
Mahalanobis distance classifier as used in [18] as the random
forest DGA classifier produced better results.

Four separate HMMs are trained with one trained on the
non-DGA class, and three trained on the three largest DGA
classes in terms of support (Post, banjori, and ramnit).
The number of hidden states is set to the average length of the
domain names in the training set. We use the Neyman-Pearson
likelihood ratio test to classify a domain as DGA generated if

log Pi∗ _−_ log P0 ≥ _η,_

where

_i[∗]_ = arg max _Pi,_
_i∈{banjori, ramnit, Post}_

_P0 is the probability of being a non-DGA, and η is a user_
specified threshold. There are a few key differences from the
HMM presented in [2]. Authors in [2] use a distinct HMM
for each family of DGA, while we only create an HMM
for the three largest classes of DGAs in the training set. In
addition, we use the Neyman-Pearson likelihood ratio test as
opposed to a threshold directly on the maximum HMM score
from the DGA HMMs. Preliminary results showed a significant
improvement in ROC over the algorithm presented in [2] when
using these updates.

Even with the improved algorithm, the HMM performed
worse than other techniques evaluated in this paper. This is
especially true for the multiclass experiment. The original
HMM algorithm in [2] was presented on only four classes,
each with a significant support. This is unlike our setup that
has thirty classes with varying degrees of support. For this
reason we omit HMM results for the multiclass experiment.

We also compare our results with those of retrospective
techniques as reported in [5]. This comparison is only done
for the binary classification as our dataset only contains


-----

ROC - Binary Classification


TABLE III: Recall for all leave-out classes

**Domain Type** **HMM** **Features** **Bigram** **LSTM** **Support**


1.0

0.8


0.6

0.4


0.2 LSTM (AUC = 0.9993)

Bigrams (AUC = 0.9939)
Manual Features (AUC = 0.9798)
HMM (AUC = 0.8916)

0.0

10[-5] 10[-4] 10[-3] 10[-2] 10[-1] 10[0]


bedep 0.83 **0.99** **0.99** **0.99** 172

beebone 0.00 **1.00** 0.00 0.00 210

corebot 0.59 **1.00** 0.71 0.77 280

cryptowall **0.30** 0.20 0.18 0.20 94

dircrypt 0.94 0.91 0.94 **0.97** 510

fobber 0.93 0.93 0.95 **0.99** 600

hesperbot 0.90 0.76 0.86 **0.92** 192

matsnu 0.00 0.02 **0.04** 0.0 48

symmi 0.00 **1.00** 0.11 0.06 64
tempedreve 0.81 0.61 0.80 **0.84** 249

|0.83 0.00 0.59 0.30 0.94 0.93 0.90 0.00 0.00 0.81|0.99 1.00 1.00 0.20 0.91 0.93 0.76 0.02 1.00 0.61|0.99 0.00 0.71 0.18 0.94 0.95 0.86 0.04 0.11 0.80|0.99 0.00 0.77 0.20 0.97 0.99 0.92 0.0 0.06 0.84|
|---|---|---|---|

|Col1|Col2|Col3|Col4|Col5|Col6|
|---|---|---|---|---|---|
|||||||
|||||||
|||||||
|||LSTM|||)|
||||LSTM|(AUC = 0.9993||
|||Bigra Manu HMM||ms (AUC = 0.9 al Features (AU (AUC = 0.8916|939) C = 0.9798) )|


LSTM (AUC = 0.9993)
Bigrams (AUC = 0.9939)
Manual Features (AUC = 0.9798)
HMM (AUC = 0.8916)


False Positive Rate

Fig. 4: ROC curves for binary classification of DGA and
non-DGA generated domains using the LSTM model, logistic
regression with bigram features, random forest classifier with
manual features, and HMM classifier.


LSTM (AUC = 0.9993)
Bigrams (AUC = 0.9939)
Manual Features (AUC = 0.9798)


TABLE I: True Positive Rates of LSTM compared to Retrospective techniques

**Technique** **True Positive Rate** **False Positive Rate**

KL Divergence [3], [4] _< 0.5_ 0.05

NXDomains [5] 0.94 0.002
LSTM **0.98** **0.001**

domain names without any contextual information. In addition, retrospective techniques perform far worse than real-time
techniques for binary classification and, therefore, will likely
degrade even further for multiclass classification.

|Technique|True Positive Rate|False Positive Rate|
|---|---|---|
|KL Divergence [3], [4] NXDomains [5] LSTM|< 0.5 0.94 0.98|0.05 0.002 0.001|


V. RESULTS

Results for the three experiments and an interpretation of
model performance are presented in this section.


_A. Binary Classification_

The ROC curves for the HMM, random forest classifier
with manually-crafted features (Manual Features), logistic regression classifier on character bigrams (Bigrams), and LSTM
DGA clasifier (LSTM) are presented in Fig. 4. Note that the
abscissa (false positive rate) is on a log scale to highlight
the differences in the algorithms. LSTM provides the best
performance with an AUC of 0.9993 with the bigram model
at 0.9939. The difference between the two algorithms may
seem small, but are actually quite significant in a production
system. As an example, the LSTM model can classify 90%
of all DGAs with a 1 in 10,000 false positive rate. On the
other hand, a Bigram model will classify the same percentage
of DGA’s with a 1 in 550 false positive rate (i.e., the Bigram
model produces a false positive rate that is 20× that of the
LSTM model).


micro 0.78 **0.90** 0.80 0.81
macro 0.53 **0.74** 0.558 0.642

The breakdown of Precision, Recall, and F1 for each class
as classified by the binary classifiers is given in Table II.
The support (size of test set) is given in the last column. In
general, classes that are the most difficult to detect have smaller
support. This is expected as they have a smaller contribution to
model updates during training than larger classes. In addition
matsnu was undetectable by all algorithms. matsnu is a
dictionary-based DGA, meaning it is created by randomly
selecting and concatenating multiple words from a dictionary.
Interestingly, suppobox is also a dictionary based DGA, but
was detectable (to some extent) by the LSTM. The size of the
suppobox training was about twenty times that of matsnu
allowing for repeats of randomly selected dictionary words.
These repeats allow the LSTM to learn the dictionaries of such
DGAs. We leave an in-depth analysis of dictionary based DGA
to future work.

The HMM performed worse than expected. The results presented in [2] only used a small number of homogenous DGA
families (Conficker, Murofet, Bobax and, Sinowal)
while the experiments in this paper use over 30 different
families. Some of these families in this paper are related, but
overall, our results were generated from a larger/more rich
dataset. As discussed later in this paper, the letter distributions
are very different across the 30 DGA families used in this paper. For example, DGA families such as Cryptolocker and
ramnit have near uniform distributions over letters, dyre
has a uniform distribution over hexadecimal characters with a
dictionary word as a prefix, and suppobox and matsnu use
English words to create domains giving a distribution very
similar to english based domains. In contrast, Conficker

[28], Murofet [29], Bobax [30] and Sinowal [31] all use
a generator that gives a uniform distribution over letters similar
to Cryptolocker and ramnit.

Table I displays the true positive rate and false positive
rate for retrospective techniques as compared to the LSTM
technique presented by this paper. As can be seen, the LSTM
technique significantly outperforms the best retrospective techniques.


_B. Leave-Class-Out Binary Classification_

The binary leave-one-out classifier is interesting as it tests
each algorithm’s robustness to DGA families not seen during


-----

TABLE II: Precision, Recall and F1 Score for Binary Classifiers

|Domain Type|Precision HMM Features Bigram LSTM|Col3|Col4|Col5|Recall HMM Features Bigram LSTM|Col7|Col8|Col9|F1 Score HMM Features Bigram LSTM|Col11|Col12|Col13|Support|
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|Alexa Cryptolocker P2P Gameover Zeus Post Tovar GOZ Volatile Cedar / Explosive banjori bedep beebone corebot cryptowall dircrypt dyre fobber geodo hesperbot matsnu murofet necurs nymaim pushdo pykspa qakbot ramnit ranbyus shifu shiotob/urlzone/bebloh simda suppobox symmi tempedreve tinba|0.8300 1.0000 1.0000 1.0000 0.0000 1.0000 1.0000 0.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 1.0000 1.0000|0.9400 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 1.0000 1.0000 1.0000|0.9700 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000|0.9900 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000|1.0000 0.9000 0.9900 1.0000 0.0000 0.5900 0.8100 0.0000 0.5900 0.1100 0.9100 1.0000 0.8900 0.9100 0.8300 0.0000 0.9200 0.8800 0.8000 0.6600 0.7200 0.9100 0.8800 0.9000 0.7200 0.9000 0.5600 0.0100 0.0000 0.7600 0.8900|1.0000 0.9800 1.0000 1.0000 0.4600 0.9400 1.0000 1.0000 1.0000 0.0600 0.9200 1.0000 0.9600 1.0000 0.7700 0.0000 1.0000 0.8400 0.5600 0.4700 0.5400 0.9600 0.9100 1.0000 0.2100 0.9700 0.0800 0.0000 1.0000 0.5700 0.9800|1.0000 0.9700 1.0000 1.0000 0.4900 1.0000 1.0000 0.9700 1.0000 0.1400 0.9600 0.9900 0.9700 0.9900 0.8500 0.0000 0.9900 0.9400 0.7300 0.5600 0.7700 0.9600 0.9400 0.9800 0.6600 0.9500 0.4000 0.0000 0.7900 0.8500 0.9700|1.0000 0.9900 1.0000 1.0000 0.9900 1.0000 1.0000 1.0000 0.9600 0.1200 0.9600 1.0000 0.9700 0.9900 0.9700 0.0000 1.0000 0.9600 0.8000 0.6000 0.9000 0.9800 0.9600 0.9800 0.7700 0.9800 0.9200 0.3200 0.6900 0.7700 0.9900|0.9100 0.9500 0.9900 1.0000 0.0000 0.7400 0.8900 0.0000 0.7400 0.1900 0.9500 1.0000 0.9400 0.9500 0.9100 0.0000 0.9600 0.9400 0.8900 0.7900 0.8400 0.9500 0.9400 0.9500 0.8400 0.9500 0.7100 0.0200 0.0000 0.8600 0.9400|0.9700 0.9900 1.0000 1.0000 0.6300 0.9700 1.0000 1.0000 1.0000 0.1100 0.9600 1.0000 0.9800 1.0000 0.8700 0.0000 1.0000 0.9100 0.7200 0.6400 0.7000 0.9800 0.9500 1.0000 0.3500 0.9900 0.1400 0.0000 1.0000 0.7300 0.9900|0.9900 0.9900 1.0000 1.0000 0.6600 1.0000 1.0000 0.9900 1.0000 0.2500 0.9800 0.9900 0.9800 1.0000 0.9200 0.0000 1.0000 0.9700 0.8500 0.7200 0.8700 0.9800 0.9700 0.9900 0.8000 0.9700 0.5800 0.0100 0.8800 0.9200 0.9900|0.9900 0.9900 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9800 0.2100 0.9800 1.0000 0.9900 1.0000 0.9800 0.0000 1.0000 0.9800 0.8900 0.7500 0.9500 0.9900 0.9800 0.9900 0.8700 0.9900 0.9600 0.4800 0.8200 0.8700 0.9900|300064 1799 298 19863 294 121678 53 65 81 29 150 2389 181 173 58 14 4292 1232 1815 507 4250 1517 27439 2625 697 3031 4449 298 18 74 18505|
|Micro Average Macro Average|0.9008 0.8655|0.9647 0.9335|0.9826 0.9668|0.9942 0.9674|0.8815 0.6787|0.9639 0.7477|0.9848 0.8006|0.9937 0.8571|0.8739 0.7335|0.9593 0.7929|0.9851 0.8468|0.9906 0.8913|16708 16708|


training. Only Recall is presented for this experiment as there
are no non-DGA generated domains in this test set. The results
for this experiment are shown in Table III.

The manual features random forest classifier performs
best in terms of both micro and macro average. On the
other hand, the LSTM classifier has the most families that
it performs best on (five in total as opposed to four in total
for the manual features classifier). The biggest discrepancy
between manual features and LSTM was with beebone. In
particular, the manual features classifier identifies all of the
beebone samples, while the LSTM model recovers none.
The domain names from beebone have a rigid structure, like
ns1.backdates13.biz andns1.backdates0.biz, so
that the LSTM model was unable to learn the structure that included the word backdates without training data. The results are nearly as dramatic for symmi,
which produces nearly-pronounceable domain names like
hakueshoubar.ddns.net, by drawing a random vowel
or a random consonant at each even-numbered index, then
drawing a random character of the opposite class (vowel/consonant) in the subsequent index location. These examples
highlight blind spots in the LSTM classifier. However, these
blind spots can be easily fixed through training with the use
of an adversarial network (i.e., train a generator network that
creates domains that confuses our classifier).

Apparently, the structure of some DGA families–even if
not elaborately designed–are peculiar enough to necessitate
their inclusion in the training set. As evident in the results
for Experiment 1 in Table II, the LSTM readily detects
these families with distinct structure when accounted for in
the training set with sufficient support. The manual features


appear to be generic enough to detect these families with high
recall. However, its important to note that manual features
were designed specifically for known DGA families and all
of our DGAs in our test set are known (i.e., our dataset is
known and labeled) making this experiment biased to a feature
based classifier. Even with this bias, the LSTM classifier still
performs best in terms of the number of DGA families it
detects.

_C. Multiclass_

The HMM results were omitted from the multiclass experiments due to poor performance. As stated previously, the
HMM algorithm was designed for few DGAs, whereas our
experiments include over 30 classes. Precision, Recall, and
_F1 is displayed in Table IV for the random forest classifier_
with manual features (Manual Features), multinomial logistic
regression on character bigrams (Bigram) and the LSTM
classifier. The LSTM classifier significantly outperforms the
other two algorithms in both the micro and macro averaged
Precision, Recall, and F1 score. In general, poor performance
resulted from classes with small representation. One exception
was Cryptolocker, which no multiclass classifier was able
to detect. However, all the binary classifiers were able to
distinguish Cryptolocker from other families.

Fig. 5 shows the confusion matrix for the LSTM multiclass classifier. A large number of the incorrectly classified Cryptolocker DGAs are classified as ramnit. To
further investigate, the unigram distributions for four DGA
families and Alexa are shown in Fig. 6. The distributions
for Cryptolocker and ramnit are both uniform over
the same range. This is expected as they are both generated


-----

TABLE IV: Precision, Recall and F1 Score for Multiclass Classifiers

**Precision** **Recall** **_F1 Score_**

Domain Type Features Bigram LSTM Features Bigram LSTM Features Bigram LSTM Support

Alexa 0.914 0.980 **0.990** 0.960 0.990 **1.000** 0.940 0.988 **0.990** 199978
Cryptolocker 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1189
P2P Gameover Zeus 0.000 **0.343** 0.327 0.000 **0.288** 0.217 0.000 **0.308** 0.247 196
Post Tovar GOZ 0.941 **1.000** **1.000** **1.000** **1.000** **1.000** 0.970 **1.000** **1.000** 13185
Volatile Cedar / Explosive 0.000 **1.000** 0.987 0.000 **1.000** 0.980 0.000 **1.000** 0.980 200
banjori 0.900 0.990 **1.000** 0.938 **1.000** **1.000** 0.920 **1.000** **1.000** 81281
bedep 0.000 0.000 **0.943** 0.000 0.000 **0.107** 0.000 0.000 **0.187** 34
beebone **1.000** **1.000** **1.000** 0.560 **1.000** **1.000** 0.713 **1.000** **1.000** 42
corebot 0.000 **1.000** **1.000** 0.000 0.980 **0.990** 0.000 0.990 **0.993** 54
cryptowall 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 15
dircrypt 0.000 **0.083** 0.000 0.000 **0.010** 0.000 0.000 **0.020** 0.000 100
dyre 0.985 0.988 **1.000** **1.000** 0.988 **1.000** 0.991 0.988 **1.000** 1600
fobber 0.000 0.000 **0.177** 0.000 0.000 **0.023** 0.000 0.000 **0.040** 121
geodo 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 114
hesperbot 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 36
matsnu 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 9
murofet **0.883** 0.643 0.783 0.066 0.542 **0.700** 0.122 0.590 **0.737** 2845
necurs 0.000 0.000 **0.643** 0.000 0.000 **0.093** 0.000 0.000 **0.160** 827
nymaim 0.000 0.390 **0.477** 0.000 0.113 **0.190** 0.000 0.175 **0.267** 1222
pushdo 0.000 0.770 **0.853** 0.000 0.588 **0.640** 0.000 0.665 **0.730** 339
pykspa 0.000 0.788 **0.910** 0.000 0.593 **0.713** 0.000 0.675 **0.800** 2827
qakbot 0.000 **0.590** 0.590 0.000 0.232 **0.387** 0.000 0.338 **0.463** 993
ramnit 0.566 0.637 **0.770** 0.654 0.763 **0.850** 0.605 0.690 **0.810** 18308
ranbyus 0.439 0.000 **0.450** 0.000 0.000 **0.517** 0.001 0.000 **0.460** 1736
shifu 0.000 0.037 **0.560** 0.000 0.003 **0.570** 0.000 0.007 **0.553** 465
shiotob/urlzone/bebloh 0.000 0.965 **0.973** 0.000 0.853 **0.907** 0.000 0.907 **0.940** 2016
simda 0.000 0.840 **0.930** 0.000 0.750 **0.977** 0.000 0.792 **0.950** 2955
suppobox 0.000 0.392 **0.833** 0.000 0.062 **0.517** 0.000 0.112 **0.627** 197
symmi 0.000 0.625 **0.913** 0.000 0.117 **0.857** 0.000 0.200 **0.883** 11
tempedreve 0.000 **0.043** 0.000 0.000 **0.010** 0.000 0.000 **0.018** 0.000 50
tinba 0.821 0.735 **0.910** 0.923 0.802 **0.990** 0.869 0.767 **0.950** 12332

Micro Average 0.851 0.933 **0.963** 0.888 0.944 **0.970** 0.867 0.940 **0.963** 11138
Macro Average 0.240 0.479 **0.614** 0.197 0.409 **0.523** 0.198 0.427 **0.541** 11138

|0.914 0.000 0.000 0.941 0.000 0.900 0.000 1.000 0.000 0.000 0.000 0.985 0.000 0.000 0.000 0.000 0.883 0.000 0.000 0.000 0.000 0.000 0.566 0.439 0.000 0.000 0.000 0.000 0.000 0.000 0.821|0.980 0.000 0.343 1.000 1.000 0.990 0.000 1.000 1.000 0.000 0.083 0.988 0.000 0.000 0.000 0.000 0.643 0.000 0.390 0.770 0.788 0.590 0.637 0.000 0.037 0.965 0.840 0.392 0.625 0.043 0.735|0.990 0.000 0.327 1.000 0.987 1.000 0.943 1.000 1.000 0.000 0.000 1.000 0.177 0.000 0.000 0.000 0.783 0.643 0.477 0.853 0.910 0.590 0.770 0.450 0.560 0.973 0.930 0.833 0.913 0.000 0.910|0.960 0.000 0.000 1.000 0.000 0.938 0.000 0.560 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.066 0.000 0.000 0.000 0.000 0.000 0.654 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.923|0.990 0.000 0.288 1.000 1.000 1.000 0.000 1.000 0.980 0.000 0.010 0.988 0.000 0.000 0.000 0.000 0.542 0.000 0.113 0.588 0.593 0.232 0.763 0.000 0.003 0.853 0.750 0.062 0.117 0.010 0.802|1.000 0.000 0.217 1.000 0.980 1.000 0.107 1.000 0.990 0.000 0.000 1.000 0.023 0.000 0.000 0.000 0.700 0.093 0.190 0.640 0.713 0.387 0.850 0.517 0.570 0.907 0.977 0.517 0.857 0.000 0.990|0.940 0.000 0.000 0.970 0.000 0.920 0.000 0.713 0.000 0.000 0.000 0.991 0.000 0.000 0.000 0.000 0.122 0.000 0.000 0.000 0.000 0.000 0.605 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.869|0.988 0.000 0.308 1.000 1.000 1.000 0.000 1.000 0.990 0.000 0.020 0.988 0.000 0.000 0.000 0.000 0.590 0.000 0.175 0.665 0.675 0.338 0.690 0.000 0.007 0.907 0.792 0.112 0.200 0.018 0.767|0.990 0.000 0.247 1.000 0.980 1.000 0.187 1.000 0.993 0.000 0.000 1.000 0.040 0.000 0.000 0.000 0.737 0.160 0.267 0.730 0.800 0.463 0.810 0.460 0.553 0.940 0.950 0.627 0.883 0.000 0.950|
|---|---|---|---|---|---|---|---|---|


using a series of multiplies, divisions and modulos based on
a single seed [13], [10]. On the other hand, suppobox is
interesting as it generates unigrams similar to distributions
seen by the Alexa top one million domains and is often
confused with the benign set. As discussed earlier, suppobox
is an English dictionary-based DGA, meaning domains are
constructed by concatenating multiple, randomly chosen words
from the English dictionary. Interestingly, only the LSTM
classifier was able to consistently detect suppobox (as seen
in Table II). This shows LSTM’s ability to extract some deep
understanding that is lost by other classifiers. Specifically, the
LSTM actually learns the dictionary used by suppobox to
construct domains.

Fig. 7 shows the all-to-all cosine distance of the unigram
distribution between all DGA families and the Alexa top one
million domains. dyre stands out as it is extremely dissimilar
to other algorithms. This is not surprising when comparing
this figure to Table 6. dyre has a nearly uniform distribution
over primarily hexadecimal numbers (non-hexadecimal letters
exist, but are rare).

When comparing both Fig. 5, Fig. 7, and Table II, some
correlation can be seen between the unigram distribution and
DGA algorithms that are often misclassified. This suggests that
it’s not only the lack of representation of these algorithms
in the training set, but also the distribution of letters that is
causing much of the misclassification. More specifically, many


DGAs produce domains that look nearly identical in terms
of their character distributions making multiclass classification
difficult if not impossible. To test this, we performed agglomerative clustering on each DGA’s family unigram distribution
using cosine distance. We set a threshold of 0.2 to define super
families (the threshold was chosen using domain knowledge
of DGA families). These super families are shown in Table V.
Interesting super families include Super Family 4 (dictionarybased DGAs), Super Family 5 (randomly selected character
DGAs), and Super Family 7 (randomly selected characters with
near equal vowels and consonants).

The same multiclass classification experiment was run on
these super families and the results are shown in VI. As
expected, all three classifiers performed much better on super
families. Results demonstrate that an actual deployment of
a multiclass DGA classification would be best run on super
families, often alerting on groups of DGAs instead of alerting
on a single family. Again, the LSTM classifier performs
significantly better than other algorithms.

_D. Model Interpretability_

We analyze the binary LSTM classifier in order to provide
some intuition about the function of the various layers. It
is important to note that in the LSTM model, each layer in
Fig. 1 is jointly optimized for the binary classification task.


-----

TABLE VI: Precision, Recall and F1 Score for Multiclass Classifiers

**Precision** **Recall** **_F1 Score_**

Domain Type Features Bigram LSTM Features Bigram LSTM Features Bigram LSTM Support

Alexa 0.930 0.980 **0.990** 0.960 0.990 **1.000** 0.940 **0.990** **0.990** 199906
Super Family 0 0.980 0.990 **1.000** **1.000** 0.990 **1.000** 0.990 0.990 **1.000** 1603
Super Family 1 **1.000** **1.000** **1.000** 0.590 **1.000** **1.000** 0.740 **1.000** **1.000** 43
Super Family 2 0.000 **1.000** **1.000** 0.000 **1.000** 0.970 0.000 **1.000** 0.990 203
Super Family 3 0.000 0.950 **0.980** 0.000 0.810 **0.900** 0.000 0.870 **0.940** 1998
Super Family 4 0.910 0.990 **1.000** 0.920 **1.000** **1.000** 0.910 0.990 **1.000** 81559
Super Family 5 0.870 0.950 **0.970** 0.880 0.940 **0.970** 0.870 0.950 **0.970** 40450
Super Family 6 0.000 0.840 **0.960** 0.000 0.550 **0.670** 0.000 0.670 **0.790** 2877
Super Family 7 0.000 0.830 **0.940** 0.000 0.680 **0.910** 0.000 0.750 **0.920** 3326
Super Family 8 0.940 0.990 **1.000** **1.000** 0.990 **1.000** 0.970 0.990 **1.000** 13267
Super Family 9 0.000 0.980 **1.000** 0.000 0.910 **1.000** 0.000 0.940 **1.000** 52
Super Family 10 0.000 0.000 **0.910** 0.000 0.000 **0.830** 0.000 0.000 **0.870** 11

Micro Average 0.896 0.977 **0.990** 0.919 0.979 **0.992** 0.903 0.980 **0.988** 28774
Macro Average 0.469 0.875 **0.979** 0.446 0.822 **0.938** 0.452 0.845 **0.956** 28774

0.2 Cryptolocker

0.1

0.0
                                                                                                - . 0 1 2 3 4 5 6 7 8 9 _ a b c d e f g h i j k l m n o p q r s t u v w x y z

0.2 ramnit

0.1

0.0
                                                                                                - . 0 1 2 3 4 5 6 7 8 9 _ a b c d e f g h i j k l m n o p q r s t u v w x y z

0.2 dyre

0.1

0.0
                                                                                                - . 0 1 2 3 4 5 6 7 8 9 _ a b c d e f g h i j k l m n o p q r s t u v w x y z

0.2 suppobox

0.1

0.0
                                                                                                - . 0 1 2 3 4 5 6 7 8 9 _ a b c d e f g h i j k l m n o p q r s t u v w x y z

0.2 Alexa Top 1M

0.1

0.0
                                                                                                - . 0 1 2 3 4 5 6 7 8 9 _ a b c d e f g h i j k l m n o p q r s t u v w x y z

Fig. 6: Unigram distributions for Cryptolocker, ramnit,
dyre, suppobox and the Alexa top one million.

|0.930 0 0.980 1 1.000 2 0.000 3 0.000 4 0.910 5 0.870 6 0.000 7 0.000 8 0.940 9 0.000 10 0.000|0.980 0.990 1.000 1.000 0.950 0.990 0.950 0.840 0.830 0.990 0.980 0.000|0.990 1.000 1.000 1.000 0.980 1.000 0.970 0.960 0.940 1.000 1.000 0.910|Col4|0.960 1.000 0.590 0.000 0.000 0.920 0.880 0.000 0.000 1.000 0.000 0.000|0.990 1.00 0.990 1.00 1.000 1.00 1.000 0.97 0.810 0.90 1.000 1.00 0.940 0.97 0.550 0.67 0.680 0.91 0.990 1.00 0.910 1.00 0.000 0.83|0 0.940 0.990 0 0.990 0.990 0 0.740 1.000 0 0.000 1.000 0 0.000 0.870 0 0.910 0.990 0 0.870 0.950 0 0.000 0.670 0 0.000 0.750 0 0.970 0.990 0 0.000 0.940 0 0.000 0.000|
|---|---|---|---|---|---|---|


Fig. 5: Confusion matrix for the LSTM multiclass model.
Blocks represent the fraction of DGA families on the vertical
axis classified as DGA families on the horizontal axis, where 0
is depicted as white and 1 depicted as black. A perfect classifier
would produce an identity matrix composed of black blocks.

Nevertheless, analyzing each layer independently does provide
some intuition about the model’s operation and performance.

The embedding layer in Fig. 1 learns a 128-dimensional
vector representation for each character in the set of valid
domain characters. A two-dimensional linear projection (via
PCA) of the character embeddings is shown in Fig. 8. It is clear
that the learned embedding consists of non-orthogonal vectors
for each character. This is in contrast to the orthonormal
one-hot encoding of bigrams used in the logistic regression
character bigram model. The placement of vectors in the


TABLE V: DGA Super Families


-----

Fig. 7: All-to-all cosine distance comparison of unigram distributions of all DGA familes and the Alexa top one million.
Distances range from 0 to 1 with 0 depicted as white and 1
depicted as black.

embedding space (and subsequently, the two-dimensional plot)
relates to the similarity or interchangeability of characters for
the DGA vs. non-DGA discrimination task. For example, one
would infer from the plot that replacing “9” with “5” would
have much less effect on the score of the DGA classifier than
would replacing “9” with “w”. The plot shows that there are
obvious clusters of numeric digits and alphabetic characters
(and underscore), while the less-common hyphen and period
are fairly dissimilar to every other character.

Next, we investigate the state (or memory) of several
LSTM cells in the second layer of the LSTM model in Fig. 1.
The state of an LSTM cell has an initial value that is updated
as each character of a domain is fed through the model. It is
a function of the current input (embedded character vector)
and the previous emission of the LSTM cell. In turn, the
LSTM’s emission is a function of the current state, current
input, and previous emission. In our model, the final emission
(corresponding to the last character in the domain) from each
of 128 LSTM cells is fed to the final logistic regression layer
of the model to produce the DGA score.

Each LSTM cell acts somewhat as an optimized feature
extractor on the sequences of embedded character vectors
produced from the previous embedding layer, and the cell’s
state provides an indication of what the cell is tracking.
Similar to [32], Fig. 9 shows the tanh of a particular LSTM
cell’s state (called memory in [32]) as it is updated characterby-character during a prediction task. As shown in Fig. 9,
some states in our model have a tendency to track common
characteristics of domain names in the dataset. For example,
Fig. 9(a) shows a state that seems to trend with domain name
length, with soft resets on periods and hyphens. The LSTM
cell state depicted in Fig. 9(b) appears to accumulate large


Fig. 8: Two-dimensional linear projection (PCA) of the embedded character vectors learned by the LSTM binary classifier.
Note that the model groups characters by similar effect on the
LSTM layer’s states and the subsequent model loss.

values for long sequences of random alphanumeric characters.
The state in Fig. 9(c) seems to accumulate value on sequences
of hexadecimal characters, as is the predominant pattern in
dyre. Finally, Fig. 9(d) depicts the most common scenario we
encountered while inspecting states: it’s generally very difficult
to determine precisely what the state is tracking. We note that
our application of LSTMs for DGA classification does not
yield quite as clearly the distinctive purpose of states as has
been demonstrated for natural language models [32].

VI. CONCLUSION

This paper presented an approach using LSTM networks
to classify DGA generated domains. LSTMs are advantageous
over other techniques as they are featureless, using raw domain
names as its input. There is no need to manually create features
that are difficult to maintain and can be rendered useless
in an adversarial machine learning setting. In addition, an
LSTM classifier can be run in real-time on single domains
on standard commodity hardware making it trivial to deploy
in virtually all security settings. Experiments on publiclyavailable datasets showed that the LSTM classifier performed
significantly better than other techniques (both real-time and
retrospective), with the ability to classify 90% of DGAs with
a false positive rate of 10[−][4]. In addition, the LSTM classifier
may be trivially modified for multiclass classification, which
can provide context about the origin and intent of the domaingenerating malware.

An in-depth analysis of results showed that the most
difficult algorithms to classify are, intuitively, those that are
modeled from a similar character distribution as domains in
the Alexa top one million. Some of these DGA families
concatenate randomly selected words from (typically) English
dictionaries. However, the LSTM classifier was able to distinguish those DGA families when the amount of training
examples were significant and the families were grouped
together in super families.


-----

3 6 0 __________________
n a x o s l t s v a c w s s q f v b ramnit

c g c p c u s m w p r k w v p l l w q __________________
c b m 2 u t w j g b g v 5 p o p c 8 q . d d n s corebot

5 0 w l a p c p i 4 5 2 g l y . d d n s __________________
t e a 5 2 5 1 9 6 2 0 3 e e 5 8 e 0 c f b 3 f 4 8 6 f c 8 5 5 7 b 4 dyre

a 9 0 5 7 5 6 f 1 2 a f 9 9 2 d 6 6 7 9 c 6 7 6 5 d 1 9 e a a 8 b f __________________

t o k u u k c c j b b n Cryptolocker

c w e e m f j x u x s x u __________________
h o s t s u r p r i s e r e n t matsnu

state 83 t i m e - h o p e - g r o c e r y
(a) approximately tracks long domain names, with a soft reset on period and hypen

g o o g l e Alexa
y o u t u b e
f a c e b o o k

b a i d u
y a h o o
w i k i p e d i a

a m a z o n

q q
t w i t t e r

l i v e
h a o 1 2 3

3 6 0 __________________
n a x o s l t s v a c w s s q f v b ramnit

c g c p c u s m w p r k w v p l l w q __________________
c b m 2 u t w j g b g v 5 p o p c 8 q . d d n s corebot

5 0 w l a p c p i 4 5 2 g l y . d d n s __________________
t e a 5 2 5 1 9 6 2 0 3 e e 5 8 e 0 c f b 3 f 4 8 6 f c 8 5 5 7 b 4 dyre

a 9 0 5 7 5 6 f 1 2 a f 9 9 2 d 6 6 7 9 c 6 7 6 5 d 1 9 e a a 8 b f __________________

t o k u u k c c j b b n Cryptolocker

c w e e m f j x u x s x u __________________
h o s t s u r p r i s e r e n t matsnu

t i m e - h o p e - g r o c e r y
(b) appears to track random alphanumeric sequences, as in ramnit, corebot, dyre and Cryptolocker

g o o g l e Alexa
y o u t u b e
f a c e b o o k

b a i d u
y a h o o
w i k i p e d i a

a m a z o n

q q
t w i t t e r

l i v e
h a o 1 2 3

3 6 0 __________________
n a x o s l t s v a c w s s q f v b ramnit

c g c p c u s m w p r k w v p l l w q __________________
c b m 2 u t w j g b g v 5 p o p c 8 q . d d n s corebot

5 0 w l a p c p i 4 5 2 g l y . d d n s __________________
t e a 5 2 5 1 9 6 2 0 3 e e 5 8 e 0 c f b 3 f 4 8 6 f c 8 5 5 7 b 4 dyre

a 9 0 5 7 5 6 f 1 2 a f 9 9 2 d 6 6 7 9 c 6 7 6 5 d 1 9 e a a 8 b f __________________

t o k u u k c c j b b n Cryptolocker

c w e e m f j x u x s x u __________________
h o s t s u r p r i s e r e n t matsnu

t i m e - h o p e - g r o c e r y
(c) appears to track hexademical sequences, as in dyre

g o o g l e Alexa
y o u t u b e
f a c e b o o k

b a i d u
y a h o o
w i k i p e d i a

a m a z o n

q q
t w i t t e r

l i v e
h a o 1 2 3

3 6 0 __________________
n a x o s l t s v a c w s s q f v b ramnit

c g c p c u s m w p r k w v p l l w q __________________
c b m 2 u t w j g b g v 5 p o p c 8 q . d d n s corebot

5 0 w l a p c p i 4 5 2 g l y . d d n s __________________
t e a 5 2 5 1 9 6 2 0 3 e e 5 8 e 0 c f b 3 f 4 8 6 f c 8 5 5 7 b 4 dyre

a 9 0 5 7 5 6 f 1 2 a f 9 9 2 d 6 6 7 9 c 6 7 6 5 d 1 9 e a a 8 b f __________________

t o k u u k c c j b b n Cryptolocker

c w e e m f j x u x s x u __________________
h o s t s u r p r i s e r e n t matsnu

t i m e - h o p e - g r o c e r y
(d) as in this example, it is difficult to ascribe an intuitive function of most states

Fig. 9: Examples of LSTM cell state values as domain characters are fed into the model. Color corresponds to the tanh of the
state, and does not necessarily denote DGA or non-DGA. Color preceeding a domain name denotes the cell’s initial state. Our
model correctly identifies DGA or non-DGA for all examples shown except for the final two matsnu examples.


g o o g l e Alexa
y o u t u b e
f a c e b o o k

b a i d u
y a h o o
w i k i p e d i a

a m a z o n

q q
t w i t t e r

l i v e
h a o 1 2 3

3 6 0 __________________
n a x o s l t s v a c w s s q f v b ramnit

c g c p c u s m w p r k w v p l l w q __________________
c b m 2 u t w j g b g v 5 p o p c 8 q . d d n s corebot

5 0 w l a p c p i 4 5 2 g l y . d d n s __________________
t e a 5 2 5 1 9 6 2 0 3 e e 5 8 e 0 c f b 3 f 4 8 6 f c 8 5 5 7 b 4 dyre

a 9 0 5 7 5 6 f 1 2 a f 9 9 2 d 6 6 7 9 c 6 7 6 5 d 1 9 e a a 8 b f __________________

t o k u u k c c j b b n Cryptolocker

c w e e m f j x u x s x u __________________
h o s t s u r p r i s e r e n t matsnu

state 83 t i m e - h o p e - g r o c e r y


g o o g l e Alexa
y o u t u b e
f a c e b o o k

b a i d u
y a h o o
w i k i p e d i a

a m a z o n

q q
t w i t t e r

l i v e
h a o 1 2 3

3 6 0 __________________
n a x o s l t s v a c w s s q f v b ramnit

c g c p c u s m w p r k w v p l l w q __________________
c b m 2 u t w j g b g v 5 p o p c 8 q . d d n s corebot

5 0 w l a p c p i 4 5 2 g l y . d d n s __________________
t e a 5 2 5 1 9 6 2 0 3 e e 5 8 e 0 c f b 3 f 4 8 6 f c 8 5 5 7 b 4 dyre

a 9 0 5 7 5 6 f 1 2 a f 9 9 2 d 6 6 7 9 c 6 7 6 5 d 1 9 e a a 8 b f __________________

t o k u u k c c j b b n Cryptolocker

c w e e m f j x u x s x u __________________
h o s t s u r p r i s e r e n t matsnu

t i m e - h o p e - g r o c e r y


-----

We also provided an in-depth analysis of the functional
interpretability of each layer in the LSTM DGA classifier. Our
analysis revealed that the model optimized vector embeddings
for each character in a somewhat intuitive way, with distinct
clusters for alphabetic and numeric digits. Our analysis of the
LSTM layer revealed the existence of LSTM cells that track
a few somewhat interpretable features such as a hexadecimal
and random character sequences. However, we found that most
states did not provide clear interpretable evidence of function,
in contrast to other applications of LSTMs, e.g., [32].

Like all models, experiments show that our model is
sensitive to class imbalance, which limits its ability to detect
families with very little support in the training set (e.g.,
matsnu, symmi and cryptowall). In the extreme case
of zero training support, it was found that the LSTM model
does not generalize well for detecting all families with very
distinctive structure. Manually-engineered features were able
to detect some of those families that an LSTM classifier
missed, and we hypothesize that this is directly a result of
expert-tuned bias in the feature set that cannot be represented
in the featureless LSTM model.

All relevant source code and suggestions on deploying a
real-world LSTM DGA classifier were provided by this paper.
In addition, we reference open datasets to create an equal
classifier to that presented in this paper. To the best of our
knowledge, the presented system is by far the best performing
DGA classification system as well as one of the easiest to
deploy.

REFERENCES

[1] M. K¨uhrer, C. Rossow, and T. Holz, “Paint it black: Evaluating the
effectiveness of malware blacklists,” in Research in Attacks, Intrusions
_and Defenses, pp. 1–21, Springer, 2014._

[2] M. Antonakakis, R. Perdisci, Y. Nadji, N. Vasiloglou, S. Abu-Nimeh,
W. Lee, and D. Dagon, “From throw-away traffic to bots: detecting the
rise of DGA-based malware,” in P21st USENIX Security Symposium
_(USENIX Security 12), pp. 491–506, 2012._

[3] S. Yadav, A. K. K. Reddy, A. Reddy, and S. Ranjan, “Detecting
algorithmically generated malicious domain names,” in Proc. 10th ACM
_SIGCOMM conference on Internet measurement, pp. 48–61, ACM,_
2010.

[4] S. Yadav, A. K. K. Reddy, A. N. Reddy, and S. Ranjan, “Detecting algorithmically generated domain-flux attacks with DNS traffic analysis,”
_Networking, IEEE/ACM Transactions on, vol. 20, no. 5, pp. 1663–1677,_
2012.

[5] S. Krishnan, T. Taylor, F. Monrose, and J. McHugh, “Crossing the
threshold: Detecting network malfeasance via sequential hypothesis
testing,” in 2013 43rd Annual IEEE/IFIP International Conference on
_Dependable Systems and Networks (DSN), pp. 1–12, IEEE, 2013._

[6] [F. Chollet, “keras.” https://github.com/fchollet/keras, 2016.](https://github.com/fchollet/keras)

[7] M. Knysz, X. Hu, and K. G. Shin, “Good guys vs. bot guise: Mimicry
attacks against fast-flux detection systems,” in INFOCOM, 2011 Pro_ceedings IEEE, pp. 1844–1852, IEEE, 2011._

[8] B. Stone-Gross, M. Cova, B. Gilbert, R. Kemmerer, C. Kruegel, and
G. Vigna, “Analysis of a botnet takeover,” Security & Privacy, IEEE,
vol. 9, no. 1, pp. 64–72, 2011.

[9] M. Ward, “Cryptolocker victims to get files back for free,” BBC News,
_August, vol. 6, 2014._

[10] [“A closer look at cyrptolocker’s DGA.” https://blog.fortinet.com/post/](https://blog.fortinet.com/post/a-closer-look-at-cryptolocker-s-dga)
[a-closer-look-at-cryptolocker-s-dga. Accessed: 2016-04-22.](https://blog.fortinet.com/post/a-closer-look-at-cryptolocker-s-dga)

[11] N. Hampton and Z. A. Baig, “Ransomware: Emergence of the cyberextortion menace,” in Australian Information Security Management
_Conference, 2015._


[12] A. Cherepanov and R. Lipovsky, “Hesperbot-A new, advanced banking
trojan in the wild,” 2013.

[13] Symantec, W32.Ramnit analysis. 2015-02-24, Version 1.0.

[14] J. Geffner, “End-to-end analysis of a domain generating algorithm
malware family.” Black Hat USA 2013, 2013.

[15] M. Antonakakis, R. Perdisci, D. Dagon, W. Lee, and N. Feamster,
“Building a dynamic reputation system for DNS.,” in USENIX security
_symposium, pp. 273–290, 2010._

[16] L. Bilge, E. Kirda, C. Kruegel, and M. Balduzzi, “Exposure: Finding
malicious domains using passive analaysis.,” in 18th Annual Network
_and Distributed System Security Symposium, 2011._

[17] L. Bilge, S. Sen, D. Balzarotti, E. Kirda, and C. Kruegel, “Exposure: a
passive DNS analysis service to detect and report malicious domains,”
_ACM Transactions on Information and System Security (TISSEC),_
vol. 16, no. 4, p. 14, 2014.

[18] S. Schiavoni, F. Maggi, L. Cavallaro, and S. Zanero, “Phoenix: DGAbased botnet tracking and intelligence,” in Detection of intrusions and
_malware, and vulnerability assessment, pp. 192–211, Springer, 2014._

[19] A. J. Robinson, “An application of recurrent nets to phone probability
estimation,” Neural Networks, IEEE Transactions on, vol. 5, no. 2,
pp. 298–305, 1994.

[20] T. Mikolov, M. Karafi´at, L. Burget, J. Cernock`y, and S. Khudanpur,
“Recurrent neural network based language model.,” in INTERSPEECH,
vol. 2, p. 3, 2010.

[21] A. Graves, “Sequence transduction with recurrent neural networks,”
_arXiv preprint arXiv:1211.3711, 2012._

[22] Y. Bengio, N. Boulanger-Lewandowski, and R. Pascanu, “Advances in
optimizing recurrent networks,” in Acoustics, Speech and Signal Pro_cessing (ICASSP), 2013 IEEE International Conference on, pp. 8624–_
8628, IEEE, 2013.

[23] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
_computation, vol. 9, no. 8, pp. 1735–1780, 1997._

[24] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:
Continual prediction with LSTM,” Neural computation, vol. 12, no. 10,
pp. 2451–2471, 2000.

[25] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise
timing with LSTM recurrent networks,” J. Machine Learning Research,
vol. 3, pp. 115–143, 2003.

[26] “Does Alexa have a list of its top-ranked websites?” [https://support.alexa.com/hc/en-us/articles/](https://support.alexa.com/hc/en-us/articles/200449834-Does-Alexa-have-a-list-of-its-top-ranked-websites-)
[200449834-Does-Alexa-have-a-list-of-its-top-ranked-websites-.](https://support.alexa.com/hc/en-us/articles/200449834-Does-Alexa-have-a-list-of-its-top-ranked-websites-)
Accessed: 2016-04-06.

[27] [“Bambenek consulting - master feeds.” http://osint.bambenekconsulting.](http://osint.bambenekconsulting.com/feeds/)
[com/feeds/. Accessed: 2016-04-06.](http://osint.bambenekconsulting.com/feeds/)

[28] P. A. Porras, H. Sa¨ıdi, and V. Yegneswaran, “A foray into conficker’s
logic and rendezvous points.,” in LEET, 2009.

[29] D. Andriesse, C. Rossow, B. Stone-Gross, D. Plohmann, and H. Bos,
“Highly resilient peer-to-peer botnets are here: An analysis of
gameover zeus,” in Malicious and Unwanted Software:” The Ameri_cas”(MALWARE), 2013 8th International Conference on, pp. 116–123,_
IEEE, 2013.

[30] [P. Royal, “On the kraken and bobax botnets.” https://www.damballa.](https://www.damballa.com/downloads/r_pubs/Kraken_Response.pdf)
[com/downloads/r pubs/Kraken Response.pdf, 2008.](https://www.damballa.com/downloads/r_pubs/Kraken_Response.pdf)

[31] B. Stone-Gross, M. Cova, L. Cavallaro, B. Gilbert, M. Szydlowski,
R. Kemmerer, C. Kruegel, and G. Vigna, “Your botnet is my botnet:
analysis of a botnet takeover,” in Proceedings of the 16th ACM
_conference on Computer and communications security, pp. 635–647,_
ACM, 2009.

[32] A. Karpathy, J. Johnson, and F.-F. Li, “Visualizing and understanding
recurrent networks,” in to appear in Proceedings of the Interna_tional Conference on Learning Representations, 2016. arXiv preprint_
arXiv:1506.02078.


-----