### Proceedings of Student-Faculty Research Day, CSIS, Pace University, May 5[th], 2017 ## Detecting Algorithmically Generated Domains Using Data Visualization and N-Grams Methods #### Tianyu Wang and Li-Chiou Chen Seidenberg School of CSIS, Pace University, Pleasantville, New York {tianyu.wang, lchen}@pace.edu **_Abstract— Recent Botnets such as Kraken, Torpig and Nugache_** **have used DNS based “domain fluxing” for command-and-** **control, where each bot queries for existence of a series of domain** **names and the owner has to register such domain name.** **Botmasters have begun employing domain generation algorithms** **(DGA) to dynamically produce a large number of random** **domains and select a small subset for actual use so that static** **domain lists ineffective. This article is to detect machine generated** **domain names; we tested common methods in classification on text** **strings of domain names has low accuracy. We introduced new** **features based on N-Grams in the classification methods and our** **experimental results show that the analysis of N-Gram methods** **can make a great progress in the accuracy of detection.** **_Index Terms— Classification Algorithms, Domain Name_** **System, Network Security, Visualization** I. INTRODUCTION any botnet detection systems use a blacklist of command-and-control (C&C) domains to detect bots and # M block their traffic. As a response, botmasters have begun employing domain generation algorithms (DGA) to dynamically produce a large number of random domains and select a small subset for actual use so that static domain lists ineffective. DGA is to be deterministic, yet generate a huge number of random domains so that bot maintainer only has to register one or few to enable the malware to work. There is a trend that more recent botnets have used DNS based “domain fluxing” for command-and-control, where each bot queries for existence of a series of domain names, such as Conficker, Kraken and Torpig. This method is called DNS “domain fluxing”, which means each bot algorithmically generates a large set of domain names and queries each of them until one of them is resolved and then the bot contacts the corresponding IP-address obtained that is typically used to host the command-and-control (C&C) server [1] [2]. Besides, for command-and-control, spammers also routinely generate random domain names in order to avoid detection [3]. This paper use the data from Alexa ranking list and DataDrivenSecurity dga dataset [20, 21]. Tianyu Wang is now a PhD candidate with the Department of Computer Science, Pace University, 861 Bedford Rd, Pleasantville, NY 10570 (e-mail: tianyu.wang@pace.edu). DGA stands for Domain Generating Algorithm and these algorithms are part of the evolution of malware communications. In the beginning, malware would be hardcoded with IP address or domain names and the botnet could be disrupted by going after whatever was hardcoded. The purpose of the DGA is to be deterministic, of which the bot maintainer only has to register one to enable the malware to phone home [4] [5]. If the domain or IP is taken down, the botnet maintainer with a new IP address can use a new name from the algorithm and the botnet maintained. Another major use case of detecting DGA is to protect non-authorized DNS servers, such as LDNS/ROOT-DNS. The purpose of building a DGA classifier is not to take down botnets, but to discover and detect the use on our network or services. Furthermore, if we are able to have a list of domains resolved and accessed at one’s organization, it is possible to see which of those are potentially generated and used by malware. This paper is organized as flows. In section 2, we discuss the background of domain names system and related security issues. We provide literature review in section 3. The DGA detection is presented in Section 4. We conclude the paper with our further research plan in section 5. II. BACKGROUND _A._ _The Domain Name System_ The Domain Name System (DNS) is a core component of Internet operation. It ensures the finding of any resource on the internet by just knowing the domain names of URL that is an easy way to remember. _B._ _Domain Name Space_ The naming system on which DNS is based is a hierarchical and logical tree structure called the domain namespace. Organizations can also create private networks that are not visible on the Internet, using their own domain namespaces. As the following figure shows, the root of the domain name space is the “.” Node. The following figure shows a subtree of the domain name space and the path to the root. Every node is Li-Chiou, Chen is the professor with the Department of Information System, School of Computer Science and Information Systems, Pace University, 861 Bedford Rd, Pleasantville, NY 10570 (e-mail: lchen@pace.edu). ----- called a level domain. Node at the base of the tree is called first level domains or Top Level Domains (TLD), for example, “edu”. Under the hierarchy, nodes are called second level domains (2LD), for example “email”, third level domains (3LD), etc. Figure 1. Domain Name Space Hierarchy. _C._ _DNS Related Security Issues_ DNS is often used to hide other kind of network traffic through the Internet. More specifically, there are many different DNS based misuse and malicious activities and related solving methods. _1)_ _DNS Fluxing_ DNS fluxing is a series of activity that enhance the availability and resilience of malicious resources and contents by hiding the real location of a given resources within a network. The hidden resource is a server that delivers malware, phishing website or command and control server of a botnet (C&C). Fast flux is one of the most common used DNS fluxing technique. It is used by botnets to hide phishing and malware delivery sites behind an ever-changing network of compromised hosts acting as proxies. It can also refer to the combination of peer-to-peer networking, distributed command and control, web-based load balancing and proxy redirection used to make malware networks more resistant to discovery and counter-measures. The Storm Worm (2007) is one of the first malware variants to make use of this technique [19]. The basic idea behind Fast flux is to have numerous IP addresses associated with a single fully qualified domain name, where the IP addresses are swapped in and out with extremely high frequency, through changing DNS records. _2)_ _Botnets_ A botnet is a number of Internet-connected devices used by a botnet owner to perform various tasks. These botnets are groups of malware machines or bots that could be remotely controlled by botmasters. Botnets can be used to perform Distributed Denial of Service (DDoS) attack, steal data, send spam, and allow the attacker access to the device and its connection. The owner can control the botnet using command and control (C&C) software. Botnets have become the main platform for cyber criminals to send spam, phishing and steal information, etc. Most of botnets rely on a centralized server (C&C). Bot could query a predefined C&C domain names that resolves IP address of server that malware commands will be received. Nowadays, in order to overcome the limitation that one single failure of C&C server is taken down, the botmaster would lose control over the botnet, C&C server have used P2P based structures in botnets, such as Storm, Zeus and Nugache [16, 17, 18]. To maintain a centralized P2P-based structure, attacker have developed a number of botnet that locate their server through algorithms generated random domain names. The related algorithm is called domain generation algorithms (DGA). _3)_ _Domain Generation Algorithms (DGA)_ Domain Generation Algorithms (DGA) is a series of algorithm that automatically generated domains names by given a random seed and then generate a list of candidate C&C domains. The botnet attempts to resolve these domains by sending DNS queries until one of the domains resolves to the IP address of a C&C server. This method introduces a convenient way to keep attacking resilience because if one domain names are identified and taken down, the bot will eventually get the valid IP address and using DNS queries to the next DGA domains. For example, Kraken and Conficker are some example of DGA-based botnets. _4)_ _DNS Monitoring_ DNS service is widely used as a core service of the whole Internet. Monitoring the DNS traffic performs an important role. Globally the technique to identify flux networks and botnets using DNS analysis have been proved efficient. However, these techniques require previous know about fluxing domain names, since it rely on classification algorithms that need training on truth data. Another issue is these techniques require large amount of DNS replies from different locations so that to compute relevant features to train classification algorithms is not easy. The time taken by these methods to identify flux networks is too long. Finally, DNS based techniques for bot infected host detestation are involved with privacy concerns. III. RELATED WORK Characteristics, such as IP addresses whose records and lexical features of phishing and non-phishing URLs have been analyzed by McGrath and Gupta [10]. They observed that the different URLs exhibited different alphabet distributions. Our work builds on this earlier work and develops techniques for identifying domains employing algorithmically generated names, potentially for “domain fluxing”. Ma, et al [9], employ statistical learning techniques based on lexical features (length of domain names, host names, number of dots in the URL etc.) and other features of URLs to automatically determine if a URL is malicious, i.e., used for phishing or advertising spam. While they classify each URL independently, our work is focused on classifying a group of URLs as algorithmically generated or not, solely by making use of the set of alphanumeric characters used. In addition, we experimentally compare against their lexical features in Section V and show that our alphanumeric distribution based features can detect algorithmically generated domain names with lower false positives than lexical features. Overall, we consider our work as complimentary and synergistic to the approach in [8]. The authors [13] develop a machine learning technique to classify individual domain names based on their network features, domain-name string composition style and presence in known reference lists. Their technique, however, relies on successful resolution of DNS domain name query. Our technique instead, ----- can analyze groups of domain names, based only on alphanumeric character features. With reference to the practice of “IP fast fluxing”, e.g., where the botnet owner constantly keeps changing the IP-addresses mapped to a C&C server, [12] implements a detection mechanism based on passive DNS traffic analysis. In our work, we present a methodology to detect cases where botnet owners may use a combination of both domain fluxing with IP fluxing, by having bots query a series of domain names and at the same time map a few of those domain names to an evolving set of IPaddresses. In addition, earlier papers [11], [8] have analyzed the inner working of IP fast flux networks for hiding spam and fraud infrastructure. With regards to botnet detection, [6], [7] perform correlation of network activity in time and space at campus network edges, and Xie et al in [14] focus on detecting spamming botnets by developing regular expression based signatures for spam URLs. M. Antonakakis present a new technique to detect randomly generated domains that most of the DGA-generated domains would result in Non-Existent Domain responses, and that bots from the same bot-net would generate similar NXDomain traffic [15]. IV. DGA DETECTION _A._ _Detection System_ Classification in machine learning would help in DGA domains detection. The purpose of building a DGA classifier is not to remove botnets, but to discover and detect the use on our network or services. Furthermore, if we can have a list of domains resolved and accessed at one’s organization, it is possible to see whether there are potentially generated and used by malware. Domain names are a series of text string, consisting of alphabet, numbers and dash sign. Therefore, it is common to use several supervised approaches to identify domains. Thus, the first step in any classifier is getting enough labeled training data. All we need is a list of legitimate domains and a list of domains generated by an algorithm. _B._ _Data Sets_ _1)_ _Alexa Domains_ For legitimate domains, an obvious choice is the Alexa list of top web sites. The Alexa Top Sites web service provides access to lists of web sites ordered by Alexa Traffic Rank. Using the web service developers can understand traffic rankings from the largest to the smallest sites. Alexa’s traffic estimates and ranks are based on the browsing behavior of people in our global data panel, which is a sample of all internet users. Alexa’s Traffic Ranks are based on the traffic data provided by users in Alexa’s global data panel over a rolling 3-month period. Traffic Ranks are updated daily. A site’s ranking is based on a combined measure of Unique Visitors and Page views. The number of unique Alexa users who visit a site on a given day determines unique Visitors. Page views are the total number of Alexa user URL requests for a site. However, multiple requests for the same URL on the same day by the same user are counted as a single Page view. The site with the highest combination of unique visitors and page views is ranked #1 [20]. However, the raw data grab from 1 Million Alexa domains are not ready for use. After we grab the top 1 Million Alexa domains (1,000,000 entries), we find that over 10 thousand are not domains but full URLs, and there are thousands of domains with subdomains that will not help. Therefore, after removing the invalid URL and subdomain and duplicated domains, we could have the clean Alexa data with 875,216 entries. In this article, we only concentrate on the domains without top level. For example, www.google.com, we only use google as domain. Table 1. First 5 Entries of Alexa data domain 0 google 1 facebook 2 youtube 3 yahoo 4 baidu It is important to shuffle the data randomly for training/testing purpose and sample only 90% of total data. In addition, we put label for this Alexa dataset as ‘legit’. The number of Alexa domains: 787,694 out of the total Alexa domains 875,216. _2)_ _DGA Domains_ On DataDrivenSecurity website, it provides file of domains and a high-level classification of “dga” or “legit” along with a subclass of either “legit”, “cryptolocker”, “goz” or “newgoz” [21]. These dga data are from recent botnets: “Cryptolocker”, two separate “Game-Over Zeus” algorithms, and an anonymous collection of algorithmically generated domains. Here we also resample 90% of the total data. Specifically, there are 47,398 out of 52,665 entries of algorithmically generated domains in our experiment. Here we also use domain names that without top-level parts. Table 2. First 5 entries of dga domain domain class 0 1002n0q11m17h017r1shexghfqf dga 1 1002ra86698fjpgqke1cdvbk5 dga 2 1008bnt1iekzdt1fqjb76pijxhr dga 3 100f3a11ckgv438fpjz91idu2ag dga 4 100fjpj1yk5l751n4g9p01bgkmaf dga _C._ _Basic Statistical Features_ Now we need to implement some features to measure domain names. The domain field here means second-level domain only. In the following article, we use domains for abbreviation. The class field is binary category, either dga or legit. DGA stands for dynamic generated algorithms domain, and legit stands for legitimate domains. _1)_ _Length_ First, we calculate the length of each domain. In the meantime, we drop those lengths that are less and equal to six, because for short domains, it is better use blacklist to filter out dga domains. |Col1|domain| |---|---| |0|google| |1|facebook| |2|youtube| |3|yahoo| |4|baidu| |Col1|domain|class| |---|---|---| |0|1002n0q11m17h017r1shexghfqf|dga| |1|1002ra86698fjpgqke1cdvbk5|dga| |2|1008bnt1iekzdt1fqjb76pijxhr|dga| |3|100f3a11ckgv438fpjz91idu2ag|dga| |4|100fjpj1yk5l751n4g9p01bgkmaf|dga| ----- _2)_ _Entropy_ Another feature is entropy of domain. In information theory, systems consist of a transmitter, channel, and receiver. The transmitter produces messages that are sent through the channel. The channel modifies the message in some way. The receiver attempts to infer which message was sent. In this context, entropy (more specifically, Shannon entropy) is the expected value (average) of the information contained in each message. This feature computes the entropy of character distribution and measure the randomness of each domain names. The entropy can explicitly be written as 𝒏𝒏 𝒏𝒏 𝑯(𝑿𝑯 𝑿) = �𝑷𝑷(𝒙𝒙𝒊𝒊)𝑰𝑰(𝒙𝒙𝒊𝒊) = −�𝑷𝑷(𝒙𝒙𝒊𝒊)𝒍𝒍𝒍𝒍𝒍𝒃𝒍 𝒃𝑷𝑷(𝒙𝒙𝒊𝒊) 𝒊𝒊=𝟏 𝒊=𝟏𝒊 Table 3. Sampling first 5 entries with length and entropy domain class length entropy 0 uchoten-anime legit 13 3.392747 1 photoprostudio legit 14 2.950212 5 andhraboxoffice legit 15 3.506891 6 kodama-tec legit 10 3.121928 7 porntubster legit 11 3.095795 _D._ _Data Visualization_ Before we begin our machine learning training, we plot scatter chart the check whether there is any correlation among the features. Figure 2. Scatter Plot: Domain Entropy vs Domain Length In this figure, we found that legit domain and DGA domain are overlapped together. When domain length is approximately equal to four, DGA has a trend that has a higher entropy than Legit. _E._ _Classification with Two Features_ The next step is to run several classification methods use these two features (length, entropy). There are 787k legit and 47k DGA domains, so we use 80/20 split techniques for our training set and testing set. We choose to use three common supervised classification methods. Random Forest, Support Vector Machines (SVM) and Naïve Bayes. Hypothesis: - Positive: domain is dga - Negative: domain is non-dga, in other words, legitimate domain _1)_ _Using Random Forest Classifier_ Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set _a)_ _Random Forest Algorithms_ A forest is the average of the predictions of its trees: 𝐽𝐽 𝐹𝐹(𝑥𝑥) = [1] 𝐽𝐽 [�𝑓𝑓][𝑖𝑖][(𝑥𝑥)] 𝑗𝑗=1 𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒 𝐽𝐽 𝑖𝑖𝑖𝑖 𝑡𝑡ℎ𝑒𝑒 𝑛𝑛𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 𝑜𝑜𝑜𝑜 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑖𝑖𝑖𝑖 𝑡𝑡ℎ𝑒𝑒 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 For a forest, the prediction is simply the average of the bias terms plus the average contribution of each feature: 𝐽𝐽 𝐾𝐾 𝐽𝐽 𝐹𝐹(𝑥𝑥) = [1] + �([1] 𝐽𝐽 [�𝑐][𝑗𝑗 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓][𝑐] 𝐽𝐽 [�𝑐][𝑐][𝑐][𝑐][𝑐][𝑐][𝑐] [𝑐][𝑐] [𝑐][𝑐] [𝑐][𝑐][𝑗𝑗][𝑐] [(𝑥][𝑥][, 𝑘][𝑘][))] 𝑗𝑗=1 𝑘𝑘=1 𝑗𝑗=1 _b)_ _Classifier Paramteres_ Parameters Values The number of features (N) 2 The number of trees in the forest (n) 100 The number of features for the best split √𝑁𝑁 The minimum number of samples to split 2 The minimum number of samples at a leaf node 1 _c)_ _Classification Results_ Predicted dga legit All True dga 2991 6379 9370 legit 427 127532 127959 All 3418 133911 137329 True Positive Rate (TPR) = 31.92% False Negative Rate (FNR) = 68.08% False Positive Rate (FPR) = 0.33% True Negative Rate (TNR) = 99.67% False Acceptance Rate (FAR) = 4.76% False Rejection Rate (FRR) = 12.49% The confusion matrix shows how our model predicts in classification using random forest classifier. The row is the true label, either dga or legit. The column is what our model predicted. Both the row and column has a total field indicate our sample size. The model performs not well. It identified dga domain as dga with only 31.92% accuracy (true positive rate). It misclassified dga domain as legit domain with 68.08% accuracy (false negative rate). Even it has a good prediction on true positive rate, which is 99.67%, the overall results in a biometric system is not good. False acceptance rate is 4.76% and false rejection rate is 12.48%. Therefore, the result of this method is not meet our requirement. _2)_ _Using SVM Classifier_ _a)_ _SVM Algorithms_ Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm |Col1|domain|class|length|entropy| |---|---|---|---|---| |0|uchoten-anime|legit|13|3.392747| |1|photoprostudio|legit|14|2.950212| |5|andhraboxoffice|legit|15|3.506891| |6|kodama-tec|legit|10|3.121928| |7|porntubster|legit|11|3.095795| |b) Classifier Paramteres|Col2| |---|---| |Parameters|Values| |The number of features (N)|2| |The number of trees in the forest (n)|100| |The number of features for the best split|√𝑁𝑁| |The minimum number of samples to split|2| |The minimum number of samples at a leaf node|1| |Predicted|dga|legit|All| |---|---|---|---| |True|||| |dga|2991|6379|9370| |legit|427|127532|127959| |All|3418|133911|137329| ----- builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. _b)_ _Classifier Parameters_ Parameters Value Kernel Linear Penalty parameter C of the error term 1 _c)_ _Classification Result_ Predicted dga legit All True dga 1160 8210 9370 legit 105 127854 127959 All 1265 136064 137329 TPR FNR FPR TNR FAR FRR 12.38% 87.62% 0.08% 99.92% 6.03% 8.30% The confusion matrix indicates how our model predicts in classification using SVM classifier. The row is the true label, either dga or legit. The column is what our model predicted. Both the row and column has a total field indicate our sample size. The model performs not well. It identified dga domain as dga with only 12.38% accuracy (true positive rate). It misclassified dga domain as legit domain with 87.62% accuracy (false negative rate). Even it has a good prediction on true positive rate, which is 99.67%, the overall results in a biometric system is not good. False acceptance rate is 6.03% and false rejection rate is 8.30%. Therefore, this method failed in classification. _3)_ _Using Naïve Bayes Classifier_ _a)_ _Naïve Bayes Algorithms_ 𝑃𝑃(𝑐𝑐|𝑥𝑥) = [𝑃][𝑃][(𝑥][𝑥][|𝑐][𝑐][)𝑃][𝑃][(𝑐][𝑐][)] 𝑃𝑃(𝑥𝑥) 𝑤𝑤ℎ𝑒𝑒𝑒𝑒𝑒𝑒 𝑃𝑃(𝑐𝑐|𝑋𝑋) = 𝑃𝑃(𝑥𝑥1) × 𝑃𝑃(𝑥𝑥2) … 𝑃𝑃(𝑥𝑥𝑛𝑛) × 𝑃𝑃(𝑐𝑐) - 𝑃𝑃(𝑐𝑐|𝑋𝑋) is the posterior probability of class (c, target) given predictor (x, metric features) - 𝑃𝑃(𝑐𝑐) is the prior probability of class - 𝑃𝑃(𝑥𝑥|𝑐𝑐) is the likelihood which is the probability of predictor given class - 𝑃𝑃(𝑥𝑥) is the prior probability of predictor - Naïve Bayes has no parameters to tune _b)_ _Classification Result_ Predicted dga legit All True dga 3332 6038 9370 legit 5061 122898 127959 All 8393 128936 137329 TPR FNR FPR TNR FAR FRR 35.56% 64.44% 3.96% 96.04% 4.68% 60.30% The confusion matrix indicates how our model predicts in classification using Naïve Bayes classifier. The row is the true label, either dga or legit. The column is what our model predicted. Both the row and column has a total field indicate our sample size. The model performs not well. It identified dga domain as dga with only 35.56% accuracy (true positive rate). It misclassified dga domain as legit domain with 64.44% accuracy (false negative rate). Even it has a good prediction on true positive rate, which is 96.04%, the overall results in a biometric system is not good. False acceptance rate is 4.68% and false rejection rate is as high as 60.30%. Therefore, the classifier predicts unsuccessful. Since these three models are not able to classify dga and legit domains successfully, we need to add more features to improve our model. _F._ _Model Improvement_ We notice that dga domain either uses some random characters as text string or uses a dictionary to make up a new text string. Therefore, we build up our own corpus for these features. _1)_ _NGram Features_ If a domain is a legit domain, it more likely exists in the Alexa ranking list. Thus, it is necessary to find the similarity of legit domains. We could use some text analysis techniques. The first step is to build up a legit text corpus. Given a subsequence of domains, we summarize the frequency distribution of Ngram among the Alexa domain name string with n = [3, 5]. We called it Alexa_grams matrix. _2)_ _Alexa Gram_ We calculate the similarity between every single domain and Alexa_grams matrix. In order to calculate the similarity, we use some matrix transformation techniques to sum up the frequency. Furthermore, we normalize the frequency by log10 as a similarity score. (See Table 5.) _3)_ _Dictionary Gram_ We use a dictionary that contains 479,623 common used word terms [22]. The terms are combination of English vocabulary and common used words with mix of number and alphabet. We will use a words dictionary. After basic cleaning up work, the following is some basic discretions about the dictionary. Similarly, we calculate the dictionary gram using N-gram, n = [3,5] and calculate the normalized similarity between words dictionary and every single domain. (See Table 5.) The reason why we choose n = 3, 4 and 5 is because we have tested n = [1,10] and found n = 3, 4, 5 have the best accuracy results. Table 4. First 5 entries of words dictionary word 37 a 48 aa 51 aaa 53 aaaa 54 aaaaaa Table 5. Sample of domain with Alexa grams and dictionary grams domain Alexa match Dict match google 23 14 facebook 42 27 |Classifier Parameters|Col2| |---|---| |Parameters|Value| |Kernel|Linear| |Penalty parameter C of the error term|1| |Predicted|dga|legit|All| |---|---|---|---| |True|||| |dga|1160|8210|9370| |legit|105|127854|127959| |All|1265|136064|137329| |TPR|FNR|FPR|TNR|FAR|FRR| |---|---|---|---|---|---| |12.38%|87.62%|0.08%|99.92%|6.03%|8.30%| |Predicted|dga|legit|All| |---|---|---|---| |True|||| |dga|3332|6038|9370| |legit|5061|122898|127959| |All|8393|128936|137329| |Col1|word| |---|---| |37|a| |48|aa| |51|aaa| |53|aaaa| |54|aaaaaa| |TPR|FNR|FPR|TNR|FAR|FRR| |---|---|---|---|---|---| |35.56%|64.44%|3.96%|96.04%|4.68%|60.30%| |domain|Alexa match|Dict match| |---|---|---| |google|23|14| |facebook|42|27| ----- |pterodactylfarts|53|76| |---|---|---| |ptes9dro- dwacty2lfa5rrts|30|28| Now, we compute N-Gram matches for all the domains and add to our data frame. Table 6. Calculated N-Gram for legit domains domain class alexa_grams word_grams investmentsonthebeach legit 144.721988 109.722683 infiniteskills legit 81.379156 72.785882 dticash legit 26.557931 23.710317 healthyliving legit 76.710198 61.721689 asset-cache legit 46.267887 31.690803 Table 7. Calculated N-Gram for dga domains domain class alexa_grams word_grams wdqdreklqnpp dga 11.242176 6.367475 wdqjkpltirjhtho dga 14.303602 16.554439 wdqxavemaedon dga 28.468264 28.699800 wdraokbcnspexm dga 25.935386 19.784933 wdsqfivqnqcbna dga 4.597991 3.629002 _4)_ _Data Visualization_ Here we plot scatter about whether our new 'alexa_grams' feature can help us differentiate between DGA and Legit domains. Figure 3. Scatter Plot: Alexa Gram vs Domain Length Figure 4. Scatter Plot: Alexa Gram vs Domain Entropy Here we want to see whether our new 'word_grams' feature can help us differentiate between Legit/DGA. Figure 5. Scatter Plot: Dictionary Gram vs Domain Length |domain|class|alexa_grams|word_grams| |---|---|---|---| |investmentsonthebeach|legit|144.721988|109.722683| |infiniteskills|legit|81.379156|72.785882| |dticash|legit|26.557931|23.710317| |healthyliving|legit|76.710198|61.721689| |asset-cache|legit|46.267887|31.690803| |domain|class|alexa_grams|word_grams| |---|---|---|---| |wdqdreklqnpp|dga|11.242176|6.367475| |wdqjkpltirjhtho|dga|14.303602|16.554439| |wdqxavemaedon|dga|28.468264|28.699800| |wdraokbcnspexm|dga|25.935386|19.784933| |wdsqfivqnqcbna|dga|4.597991|3.629002| |Predicted|dga|legit|All| |---|---|---|---| |True|||| |dga|9139|231|9370| |legit|254|127705|127959| |All|9393|127936|137329| |TPR|FNR|FPR|TNR|FAR|FRR| |---|---|---|---|---|---| |97.53%|2.47%|0.20%|99.80%|0.18%|2.70%| |Predicted|dga|legit|All| |---|---|---|---| |True|||| |dga|8623|747|9370| |legit|534|127425|127959| |All|9157|128172|137329| |TPR|FNR|FPR|TNR|FAR|FRR| |---|---|---|---|---|---| |92.03%|7.97%|0.42%|99.58%|0.58%|5.83%| The confusion matrix indicates how our model predicts in classification using SVM classifier. The row is the true label, either dga or legit. The column is what our model predicted. Both the row and column has a total field indicate our sample size. The model performs pretty well. It identified dga domain as dga with 92.03% accuracy (true positive rate). It misclassified dga domain as legit domain as low as 7.97% (false negative rate). It has a good prediction on true positive rate, Figure 6. Scatter Plot: Dictionary Gram vs Entropy After we add two extra features, the overlapped issue improved. We could have a clear view that legit, dga has their own clusters, and it is more reasonable to perform some classification methods once again. _5)_ _Classification with Four Feature_ Now we have four features in our model: Length, Entropy, Alexa_grams, and Dict_grams. We could use the same parameters tuning our classification model. _a)_ _Using Random Forest Classifier_ dga 9139 231 9370 legit 254 127705 127959 All 9393 127936 137329 The confusion matrix indicates how our model predicts in classification using random forest classifier. The row is the true label, either dga or legit. The column is what our model predicted. Both the row and column has a total field indicate our sample size. The model performs pretty well. It identified dga domain as dga with 97.53% accuracy (true positive rate). It misclassified dga domain as legit domain as low as 2.47% (false negative rate). It has a good prediction on true positive rate, which is 99.80%, It also has low false positive rate which is 0.20%. The overall results in a biometric system is good as well. False acceptance rate is 0.18% and false rejection rate is 2.70%. Therefore, this method succeeds in classification. _b)_ _Using SVM Classifier_ dga 8623 747 9370 legit 534 127425 127959 All 9157 128172 137329 ----- which is 99.80%, It also has low false positive rate which is 0.42%. The overall results in a biometric system is good as well. False acceptance rate is 0.58% and false rejection rate is 5.83%. Therefore, this method succeeds in classification. _c)_ _Using Naïve Bayes Classifier_ |Col1|Length|Entropy|Alexa_grams|Dict_grams| |---|---|---|---|---| |Score|0.2925341|0.21776668|0.36576691|0.1239323| dga 7203 2167 9370 legit 354 127605 127959 All 7557 129772 137329 |Predicted|dga|legit|All| |---|---|---|---| |True|||| |dga|7203|2167|9370| |legit|354|127605|127959| |All|7557|129772|137329| |TPR|FNR|FPR|TNR|FAR|FRR| |---|---|---|---|---|---| |76.87%|23.13%|0.28%|99.72%|1.67%|4.68%| The confusion matrix indicates how our model predicts in classification using Naïve Bayes classifier. The row is the true label, either dga or legit. The column is what our model predicted. Both the row and column has a total field indicate our sample size. The model performs pretty well. It identified dga domain as dga with only 76.87% accuracy (true positive rate). It misclassified dga domain as legit domain with 23.13% (false negative rate). It has a good prediction on true positive rate, which is 99.72%. It has low false positive rate, which is 0.28%. The overall results in a biometric system is not good. False acceptance rate is 1.67% and false rejection rate is 4.68%. Therefore, this method failed in classification. _6)_ _Model Comparisons_ Table 8. Model Comparisons Performance Rate Random Forest SVM Naïve Bayes TPR 97.53% 92.03% 76.87% FNR 2.47% 7.97% 23.13% FPR 0.20% 0.42% 0.28% TNR 99.80% 99.58% 99.72% FAR 0.18% 0.58% 1.67% FRR 2.70% 5.83% 4.68% For true positive, true negative rate, the higher the better, because it means more accurate on our prediction. For false positive rate, true negative rate, false acceptance rate and false rejection rate, the lower the better, because it means the type I and type II error rates. Among all three models, Random Forest classifier outperforms the best. The reason that random forest performs the best is because random forest is a multi-layer decision tree. It will subgroup every details of features in a tree structure. The domain is a series of text string, and a tree structure classifier very easily captures the specific features of text string. However, linear SVM is trying to draw several straight line between the features of data. The scatter plot shows that we still have overlapped data among all the features so that the accuracy of SVM is not as good as random forest. The Naïve Bayes is a combination of conditional probabilities, and a single gram is not effective among text string. We used this classifier as our prediction model. We also calculate the importance score on these four features. The importance of a feature is computed as normalized total reduction of the criterion brought by that feature. Table 9. Importance Score on Random Forest Length Entropy Alexa_grams Dict_grams Score 0.2925341 0.21776668 0.36576691 0.1239323 We found that the most important feature in our model is Alexa_grams. It indicates that Alexa ranking maintains a good contribution on dga classification. It proves our hypotheses that most of botnet masters are using dictionary or random characters to generate malicious domains. The second ranking is length of domain names followed by entropy and Dict_grams. It indicates that more and more botnet masters are using some English words dictionary as their algorithms input. Our methods could also detect dga that using dictionary. _7)_ _Misclassification_ _a)_ _Educational Institution Domains_ First, look at a piece of our prediction sample. The following table is an example of prediction using random forest as a classifier. It performs and predicts well except some university domain names. For example, tsinghua.edu.cn and sjtu.edu.cn are the domain names of university in China. Table 10. Prediction sample domain prediction google legit webmagnat.ro legit bikemastertool.com legit 1cb8a5f36f dga pterodactylfarts legit pybmvodrcmkwq.biz dga abuliyan.com legit bey666on4ce dga sjtu.edu.cn dga tsinghua.edu.cn dga Table 11. Misclassification sample domain length entropy alexa_gram word_gram predict duurzaamthuis 13 3.18083 20.353 17.785 legit hutkuzwropgf 12 3.4183 14.240 10.431 legit xn-ecki4eoz0157d 28 4.28039 37.036 15.577 legit hv1bosfom5c nllcolooxrycoy 14 2.61058 31.160 26.914 dga dktazhqlzsnorer 15 3.64022 24.592 22.804 legit eprqhtyhoplu 12 3.25163 24.762 19.213 dga domowe-wypieki 14 3.23593 28.051 24.537 legit taesdijrndsatw 14 3.23593 30.930 21.647 dga edarteprsytvhww 15 3.37356 36.684 29.358 dga ukonehloneybmfb 15 3.37356 39.44 36.303 dga ekgzkawofkxzlq 14 3.32486 7.0389 5.4897 legit For those legit domains but our model treat them as dga, some of legit domains come from foreigner countries. For example, domowe-wypieki comes from www.domowewypieki.com, which is a homemade pastries food website in polish. These countries use very different word and character system than those in English. In order to use English words in domain system, many of domains are adapted and made of some initial letters of approximately pronunciation of foreigner language. This is why some legit domain arise misclassification issue. |domain|prediction| |---|---| |google|legit| |webmagnat.ro|legit| |bikemastertool.com|legit| |1cb8a5f36f|dga| |pterodactylfarts|legit| |pybmvodrcmkwq.biz|dga| |abuliyan.com|legit| |bey666on4ce|dga| |sjtu.edu.cn|dga| |tsinghua.edu.cn|dga| |Performance Rate|Random Forest|SVM|Naïve Bayes| |---|---|---|---| |TPR|97.53%|92.03%|76.87%| |FNR|2.47%|7.97%|23.13%| |FPR|0.20%|0.42%|0.28%| |TNR|99.80%|99.58%|99.72%| |FAR|0.18%|0.58%|1.67%| |FRR|2.70%|5.83%|4.68%| |domain|length|entropy|alexa_gram|word_gram|predict| |---|---|---|---|---|---| |duurzaamthuis|13|3.18083|20.353|17.785|legit| |hutkuzwropgf|12|3.4183|14.240|10.431|legit| |xn-- ecki4eoz0157d hv1bosfom5c|28|4.28039|37.036|15.577|legit| |nllcolooxrycoy|14|2.61058|31.160|26.914|dga| |dktazhqlzsnorer|15|3.64022|24.592|22.804|legit| |eprqhtyhoplu|12|3.25163|24.762|19.213|dga| |domowe-wypieki|14|3.23593|28.051|24.537|legit| |taesdijrndsatw|14|3.23593|30.930|21.647|dga| |edarteprsytvhww|15|3.37356|36.684|29.358|dga| |ukonehloneybmfb|15|3.37356|39.44|36.303|dga| |ekgzkawofkxzlq|14|3.32486|7.0389|5.4897|legit| ----- For those dga domains but our model regards them as legit, probably because Alexa ranking only summarize the unique visiting volume. Thus, there are still so many malicious and dga domain are among Alexa dataset. _b)_ _Discussion_ There are some potential ways to address those issues above and improve our model. First, we could set up a filter to sort the top-level domain (TLD) on those education and non-profit domains. In addition, for those foreign websites, we would try to figure out how these domains works and find a better legit dataset, except for Alexa. We could also use other dictionary such as Wiki keywords as our classifier features. At last, we plan to build up a self-adapted machine learning architecture that could learn from real-time DNS traffic, detect, and prevent those anomaly activities in our future research. V. CONCLUSION AND DISCUSSION In this paper, we introduce the necessary about detection of DGA domains. In addition, we tested three common machine learning algorithms, random forest, SVM and Naïve Bayes, to classify legit and DGA domain names. We provide data visualization techniques with two new features, Alexa gram and Dictionary gram in classification experiment. At last, we found introducing NGram features would increase the accuracy of classification models and random forest classifier performs the best among all. We also found some issue using our methods and come up some ideas to solve the problem. We plan to improve our classification method and then setup our own DNS servers and build up two-engine network monitoring system. One is for machine learning training and model updating. The other one is for real-time monitoring for prevention. REFERENCES [1] S. Yadav, A. K. K. Reddy, A. L. N. Reddy, and S. Ranjan, “Detecting algorithmically generated malicious domain names,” presented at the the 10th annual conference, New York, New York, USA, 2010, pp. 48–61. [2] S. Yadav, A. K. K. Reddy, A. L. N. Reddy, and S. Ranjan, “Detecting algorithmically generated domain-flux attacks with DNS traffic analysis,” IEEE/ACM Transactions on Networking (TON, vol. 20, no. 5, Oct. 2012. [3] A. Reddy, “Detecting Networks Employing Algorithmically Generated Domain Names,” 2010. [4] Z. Wei-wei and G. Qian, “Detecting Machine Generated Domain Names Based on Morpheme Features,” 2013. [5] P. Barthakur, M. Dahal, and M. K. Ghose, “An Efficient Machine Learning Based Classification Scheme for Detecting Distributed Command & Control Traffic of P2P Botnets,” International Journal of Modern …, 2013. [6] G. Gu, R. Perdisci, J. Zhang, and W. Lee. BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-independent Botnet Detection. Proceedings of the 17th USENIX Security Symposium (Security’08), 2008. [7] G. Gu, J. Zhang, and W. Lee. BotSniffer: Detecting Botnet Command and Control Channels in Network Traffic. Proc. of the 15th Annual Network and Distributed System Security Symposium (NDSS’08), Feb. 2008. [8] T. Holz, M. Steiner, F. Dahl, E. W. Biersack, and F. Freiling. Measurements and Mitigation of Peer-to-peer-based Botnets: A Case Study on Storm Worm. In First Usenix Workshop on Large-scale Exploits and Emergent Threats (LEET), April 2008. [9] S. S. J. Ma, L.K. Saul and G. Voelker. Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs. Proc. of ACM KDD, July 2009. [10] D.K.McGrathandM.Gupta.BehindPhishing:AnExaminationofPhisher Modi Operandi. Proc. of USENIX workshop on Large-scale Exploits and Emergent Threats (LEET), Apr. 2008. [11] E. Passerini, R. Paleari, L. Martignoni, and D. Bruschi. Fluxor : Detecting and Monitoring Fast-flux Service Networks. Detection of Intrusions and Malware, and Vulnerability Assessment, 2008. [12] R. Perdisci, I. Corona, D. Dagon, and W. Lee. Detecting Malicious Flux Service Networks Through Passive Analysis of Recursive DNS Traces. In Annual Computer Society Security Applications Conference (ACSAC), dec 2009. [13] M. Antonakakis, R. Perdisci, D. Dagon,W. Lee, and N. Feamster. Building a Dynamic Reputation System for DNS. In USENIX Security Symposium,2010. [14] Y. Xie, F. Yu, K. Achan, R. Panigrahy, G. Hulten, and I. Osipkov. Spamming Botnets: Signatures and Characteristics. ACM SIGCOMM Computer. [15] Manos Antonakakis, Roberto Perdisci, Yacin Nadji, Nikolaos Vasiloglou, Saeed Abu-Nimeh, Wenke Lee, and David Dagon. 2012. From throwaway traffic to bots: detecting the rise of DGA-based malware. In Proceedings _of_ _the_ _21st_ _USENIX_ _conference_ _on_ _Security_ _symposium (Security'12). USENIX Association, Berkeley, CA, USA, 24-_ 24. [16] ZeuS Gets More Sophisticated Using P2P Techniques. http://www.abuse.ch/?p=3499, 2011 [17] S. Stover, D. Dittrich, J. Hernandez, and S. Dietrich. Analysis of the storm and nugache trojans: P2P is here. In _USENIX; login:, vol. 32, no. 6,_ December 2007. [18] Wikipedia. The storm botnet. http://en.wikipedia.org/wiki/Storm_botnet. [19] Prince, Brian (January 26, 2007). "'Storm Worm' Continues to Spread Around Globe". FOXNews.com. Retrieved 2007-01-27. [20] Alexa ranking, https://aws.amazon.com/alexa-top-sites/ [21] Dataset collection, http://datadrivensecurity.info/blog/pages/dds-datasetcollection.html [22] Data hacking, http://clicksecurity.github.io/data_hacking/ -----