{
	"id": "4f48dd35-e5c1-4640-bc2e-3f2c23dc5877",
	"created_at": "2026-04-06T01:32:28.812709Z",
	"updated_at": "2026-04-10T03:20:39.528609Z",
	"deleted_at": null,
	"sha1_hash": "1b047930461f52c89241fbf387be9e367d6ac3c9",
	"title": "Building a DGA Classifier: Part 2, Feature Engineering",
	"llm_title": "",
	"authors": "",
	"file_creation_date": "0001-01-01T00:00:00Z",
	"file_modification_date": "0001-01-01T00:00:00Z",
	"file_size": 202837,
	"plain_text": "Building a DGA Classifier: Part 2, Feature Engineering\r\nBy “Jay Jacobs (@jayjacobs)\r\nPublished: 2014-10-02 · Archived: 2026-04-06 01:18:04 UTC\r\nBuilding a DGA Classifier: Part 2, Feature Engineering\r\nBy “Jay Jacobs (@jayjacobs)\"\r\nThu 02 October 2014 | tags: blog, r, rstats, -- (permalink)\r\nThis is part two of a three-part blog series on building a DGA classifier and it is split into the three phases of\r\nbuilding a classifier: 1) Data preperation 2) Feature engineering and 3) Model selection.\r\nBack in part 1, we prepared the data and we are starting with a nice clean list of domains labeled as either\r\nlegitamate (“legit”) or generated by an algorithm (“dga”).\r\nlibrary(dga)\r\ndata(sampledga)\r\nIn any machine learning approach, you will want to construct a set of “features” that help describe each class or\r\noutcome you are attempting to predict. Now the challenge with selecting features is that different models have\r\ndifferent assumptions and restrictions on the type of data fed into them. For example, a linear regression model is\r\nvery picky about correlated features while a random forest model will handle those without much of a hiccup. But\r\nthat’s something we’ll have to face in when we are selecting a model. For now, we will want to gather up all the\r\nfeatures we can think of (and have time for) and then we can sort them out in the final model.\r\nIn our case, all we have to go off of is a domain name: a string of letters, numbers and maybe a dash. We have to\r\nthink of what makes the domains generated by an algorithm that much different from a normal domain. For\r\nexample, in the Click Security example they calculate the following features for each domain:\r\nLength in characters\r\nEntropy (range of characters)\r\nn-grams (3,4,5) and the “distance” from the n-grams of known legit domains\r\nn-grams (3,4,5) and the “distance” from the n-grams of dictionary words\r\ndifference between the two distance calculations\r\nThere is an almost endless list of other features you could come up with beyond those:\r\nratio of numbers to (length/vowels|consonants/non-numbers)\r\nratio of vowels to (length/numbers/etc)\r\nproportion matching dictionary words\r\nlargest dictionary word match\r\nall the combinations of n-grams (mentioned above)\r\nhttps://datadrivensecurity.info/blog/posts/2014/Oct/dga-part2/\r\nPage 1 of 6\n\nMarkov chain of probable combinations of characters\r\nSimplicity is the name of the game\r\nI know it may seem a bit counter-intuitive, but simplicity is the name of the game when doing feature selection. At\r\nfirst thought, you may think you should try every feature (and combinations of features) so you can build the very\r\nbest model you can, but there are many reasons to not do that. First, no model or algorithm is going to be perfect\r\nand the more robust solutions will employ a variety of solutions (not just a single algorithm). So striving for\r\nperfection has diminishing returns.\r\nSecond, adding too many features may cause you to overfit to your training data. That means you culd build a\r\nmodel that appears to be very accurate in your tests, but stinks with any new data (or new domain generating\r\nalgorithms in our case). Finally, every feature will take some level of time and effort to generate and process, and\r\nthese add up quickly. The end result is that you should have just enough features to be helpful, and no more\r\nthan that. The Click Security example, in my opinion, does an excellent job at this balance with just a handful\r\nof features.\r\nAlso, there isn’t any exact science to selecting features, so get that notion that science is structured, clean and\r\norderly right out of your head. Feature seelction will, at least at this state, rely heavily on domain expertise. As we\r\nget to the model selection, we will be weeding out variables that don’t help, are too slow or contradict the model\r\nwe are testing.\r\nFor now, think of what makes a domain name usable. For example, when people create a domain name the focus\r\non readability so they may include one set of digits together “host55” and rarely would they do “h5os5t”, so\r\nperhaps looking at number usage could be good. Or you could assume that randomly selecting from 26 charcters\r\nand 10 numbers will create some very strange combinations of characters not typically found in a language.\r\nTherefor, in legitimate domains, you expect to see more combinations like “est” and less “0zq”. The task when\r\ndoing feature selection is to find attributes that indicate the difference. Just as in real life, if you want to classify a\r\ncar from a motorcycle a good feature may be number of tires on the road, you want to find attributes to measure\r\nthat seperate legitimate domains from those generated by an algorithm.\r\nN-Grams\r\nI hinted at n-grams in the previous paragraph and they may be a little difficult to grasp if you’ve never thought\r\nabout it. But they are built on the premise that there are frequent character patterns in natural language. Anyone\r\nwho’s watched “Wheel of Fortune” knows that consonants like r, s, t, n and l appear a lot more often than m, w,\r\nand q and nobody guesses “q” as their first choice letter. One of the features you could include is a simple count of\r\nthe characters like that (could be called a “1-gram”, “n-gram of 1”, “unigram” or simply “charater frequency”\r\nsince it’s single charaters). Randomly generated domains would have a much different distribution of characters\r\nthan those generated based on natural language. That difference should help an algorithm correctly classify\r\nbetween the two.\r\nBut you can get fancier than that and look at the frequency of the combination of characters, the “n” in “n-grams”\r\nrepresents a variable length. You could look for the combination of 3-characters, so let’s take a look at how that\r\nlooks with the stringdist package and the qgrams() function.\r\nhttps://datadrivensecurity.info/blog/posts/2014/Oct/dga-part2/\r\nPage 2 of 6\n\nlibrary(stringdist)\r\nqgrams(\"facebook\", q=3)\r\n## fac ook ace ceb ebo boo\r\n## V1 1 1 1 1 1 1\r\nqgrams(\"sandbandcandy\", q=3)\r\n## san and ndb ndc ndy dba dca ban can\r\n## V1 1 3 1 1 1 1 1 1 1\r\nqgrams(\"kykwdvibps\", q=3)\r\n## kyk ykw wdv vib kwd dvi ibp bps\r\n## V1 1 1 1 1 1 1 1 1\r\nSee how the function pulls out groups of 3 characters that appear contiguously? Also, look at the difference in the\r\ncollection of trigrams from the first two, they don’t look too weird, but the output from kykwdvibps probably\r\ndoesn’t match your expectation of character combinations you are used to in the english langauge. That is what we\r\nwant to capitalize on. All we have to do is teach the algorithm everything about the english language, easy right?\r\nActually, we just have to teach it what should be “expected” as far as character combinations, and we can do that\r\nby figuring out what n-grams appear in legitimate domains and then calculate the difference.\r\n# pull domains where class is \"legit\"\r\nlegitgram3 \u003c- qgrams(sampledga$domain[sampledga$class==\"legit\"], q=3)\r\n# what's at the top?\r\nlegitgram3[1, head(order(-legitgram3), 10), drop=F]\r\n## ing ter ine the lin ion est ent ers and\r\n## V1 161 138 130 113 111 106 103 102 100 93\r\nNotice how we have over 7,000 trigrams here with many of them appearing in a very small proportion, let’s clean\r\nthose up so the oddities/outliers don’t throw the training. We have 5,000 legit domains, we should be cutting off\r\nthe infrequent occurances, and we could experiment with what that cutoff should be. But let’s create the n-grams\r\nof length 1, 2, 3, 4 and 5, but I will use the function in the dga package called ngram and I’ll recreate the 3-\r\ngram above. I’ll also include the ngram of lengths 3, 4 and 5.\r\nlegitname \u003c- sampledga$domain[sampledga$class==\"legit\"]\r\nonegood \u003c- ngram(legitname, 1)\r\ntwogood \u003c- ngram(legitname, 2)\r\nthreegood \u003c- ngram(legitname, 3)\r\nfourgood \u003c- ngram(legitname, 4)\r\nhttps://datadrivensecurity.info/blog/posts/2014/Oct/dga-part2/\r\nPage 3 of 6\n\nfivegood \u003c- ngram(legitname, 5)\r\ngood345 \u003c- ngram(legitname, c(3,4,5))\r\nLet’s just do a quick smell test here and look at some values with the getngram function in the dga package\r\nand how they compare with various n-grams.\r\ngood \u003c- c(\"facebook\", \"google\", \"youtube\",\r\n \"yahoo\", \"baidu\", \"wikipedia\")\r\ngetngram(threegood, good)\r\n## facebook google youtube yahoo baidu wikipedia\r\n## 7.264 7.550 6.674 2.593 0.699 7.568\r\nbad \u003c- c(\"hwenbesxjwrwa\", \"oovftsaempntpx\", \"uipgqhfrojbnjo\",\r\n \"igpjponmegrxjtr\", \"eoitadcdyaeqh\", \"bqadfgvmxmypkr\")\r\ngetngram(threegood, bad)\r\n## hwenbesxjwrwa oovftsaempntpx uipgqhfrojbnjo igpjponmegrxjtr\r\n## 2.6812 4.1216 2.9499 2.7482\r\n## eoitadcdyaeqh bqadfgvmxmypkr\r\n## 3.7638 0.6021\r\nNotice these aren’t perfect and that’s okay, the algorithms you will try out in Part 3 of the series won’t use just one\r\nvaraible. The strength of the classifier will come from the use all the variables together. So let’s go ahead and\r\nconstruct all the features that we want to use here and prepare for step 3 where you will you will select a model by\r\ntrying varius classifiers. In a real application, there is a relationship between feature generation and model\r\nselection. Algorithms will act differently on different features and after trying a few you may want to go back to\r\nfeature selection and add or remove some features.\r\nFor the sake of simplicity, we will go with these 5 sets of n-grams and the multiple length set used in the the Click\r\nSecurity model.\r\nPrepping the rest of the features\r\nNow that you understand n-grams, you can go ahead and generate the rest of the features and save them off for\r\nlater. Note that everytime you will want to classify a new domain, you will need to generate the list of features. So\r\nthe reference n-grams generated above will have to be saved to generate the features that rely on them.\r\n# dga package has \"entropy\" to calculate entropy\r\nsampledga$entropy=entropy(sampledga$domain)\r\n# get length (number of characters) in domain name\r\nsampledga$length=nchar(sampledga$domain)\r\n# calc distances for each domain\r\nsampledga$onegram \u003c- getngram(onegood, sampledga$domain)\r\nhttps://datadrivensecurity.info/blog/posts/2014/Oct/dga-part2/\r\nPage 4 of 6\n\nsampledga$twogram \u003c- getngram(twogood, sampledga$domain)\r\nsampledga$threegram \u003c- getngram(threegood, sampledga$domain)\r\nsampledga$fourgram \u003c- getngram(fourgood, sampledga$domain)\r\nsampledga$fivegram \u003c- getngram(fivegood, sampledga$domain)\r\nsampledga$gram345 \u003c- getngram(good345, sampledga$domain)\r\nNote that I am just tossing in every n-gram from 1 to 5 characters and the merging of 3, 4, and 5 n-grams. I doubt\r\nthat all of these will be helpful and I fully expect that many of these will be dropped in the final model, which I\r\nwill cover in part 3.\r\nDictionary matching\r\nThere is one last feature I want to add and that will try to answer the question of “How much of the string can be\r\nexplained by a dictionary?” I’m adding it because I’ve already created several models and found myself getting\r\nfrustrated seeing a domain like “oxfordlawtrove” being classified as a “dga”, but any human can look at that and\r\nsee three distinct words. Therfore, I created the function wmatch in the DGA package to return the percentage of\r\ncharacters that are in the dictionary. I also am using the dictionary that was included in Click Security’s code and\r\nit seems to be a little loose about what is a valid word. At some point that dictionary could be rebuilt and cleaned\r\nup. But, for the sake of time, we can just go with it how it is.\r\nwmatch(c(\"facebook\", \"oxfordlawtrove\", \"uipgqhfrojbnjo\"))\r\n## [1] 1.0000 1.0000 0.4286\r\n# calculate it for every word in the sample\r\nsampledga$dict \u003c- wmatch(sampledga$domain)\r\n# and let's look at a few randomly (3 legit, 3 dga)\r\nsampledga[c(sample(5000, 3), sample(5000, 3)+5000), c(6:14)]\r\n## entropy length onegram twogram threegram fourgram fivegram gram345\r\n## 162 2.725 9 28.52 14.494 6.552 3.6444 1.69 11.887\r\n## 291 1.922 5 16.02 8.868 2.924 0.6021 0.00 3.526\r\n## 473 2.922 10 34.29 20.179 8.397 1.4771 0.00 9.875\r\n## 6519 3.027 13 43.00 19.517 3.085 0.0000 0.00 3.085\r\n## 39999 3.804 24 71.32 21.617 1.833 0.0000 0.00 1.833\r\n## 34989 3.852 28 83.38 22.578 0.699 0.0000 0.00 0.699\r\n## dict\r\n## 162 0.8889\r\n## 291 1.0000\r\n## 473 1.0000\r\n## 6519 0.4615\r\n## 39999 0.3750\r\n## 34989 0.2143\r\nhttps://datadrivensecurity.info/blog/posts/2014/Oct/dga-part2/\r\nPage 5 of 6\n\nAnd because what we want in the features is a seperation in our classes, we can use the fun package GGally to\r\nvisualize the interaction between some of our varibles (this graphic takes a while to generate).\r\nlibrary(GGally)\r\nlibrary(ggplot2)\r\ngg \u003c- ggpairs(sampledga,\r\n columns = c(\"entropy\", \"length\", \"onegram\", \"threegram\", \"dict\", \"class\"),\r\n color=\"class\",\r\n lower=list(continuous=\"smooth\", params=c(alpha=0.5)),\r\n diag=list(continuous=\"bar\", combo=\"bar\", params=c(alpha=0.5)),\r\n upper = list(continuous = \"density\", combo = \"box\", params=c(alpha=0.5)),\r\n axisLabels='show')\r\nprint(gg)\r\nIt’s pretty clear in the picture that the last dictionary matching feature I added creates quite a large seperator for\r\nthe two datasets. Now let’s save off the sample object for use in part 3 of the blog series.\r\nsave(sampledga, file=\"data/sampledga.rda\", compress=\"xz\")\r\nSource: https://datadrivensecurity.info/blog/posts/2014/Oct/dga-part2/\r\nhttps://datadrivensecurity.info/blog/posts/2014/Oct/dga-part2/\r\nPage 6 of 6",
	"extraction_quality": 1,
	"language": "EN",
	"sources": [
		"MITRE"
	],
	"references": [
		"https://datadrivensecurity.info/blog/posts/2014/Oct/dga-part2/"
	],
	"report_names": [
		"dga-part2"
	],
	"threat_actors": [],
	"ts_created_at": 1775439148,
	"ts_updated_at": 1775791239,
	"ts_creation_date": 0,
	"ts_modification_date": 0,
	"files": {
		"pdf": "https://archive.orkl.eu/1b047930461f52c89241fbf387be9e367d6ac3c9.pdf",
		"text": "https://archive.orkl.eu/1b047930461f52c89241fbf387be9e367d6ac3c9.txt",
		"img": "https://archive.orkl.eu/1b047930461f52c89241fbf387be9e367d6ac3c9.jpg"
	}
}