{
	"id": "5c26037c-e69b-4756-9aeb-635af8fc30de",
	"created_at": "2026-04-06T00:17:56.185925Z",
	"updated_at": "2026-04-10T03:37:23.886463Z",
	"deleted_at": null,
	"sha1_hash": "fb130532fefc0d6433fe82fd8ba7d9f6c4df6cf9",
	"title": "Threat hunting in large datasets by clustering security events",
	"llm_title": "",
	"authors": "",
	"file_creation_date": "0001-01-01T00:00:00Z",
	"file_modification_date": "0001-01-01T00:00:00Z",
	"file_size": 2293801,
	"plain_text": "Threat hunting in large datasets by clustering security events\r\nBy Tiago Pereira\r\nPublished: 2021-10-04 · Archived: 2026-04-05 19:53:29 UTC\r\nMonday, October 4, 2021 14:22\r\nEtt fel inträffade.\r\nDet går inte att köra JavaScript.\r\nBy Tiago Pereira.\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 1 of 16\n\nSecurity tools can produce very large amounts of data that even the most sophisticated organizations may\r\nstruggle to manage.\r\nBig data processing tools, such as spark, can be a powerful tool in the arsenal of security teams.\r\nThis post walks through threat hunting on large datasets by clustering similar events to reduce search space\r\nand provide additional context.\r\nThere is a limit to the amount of information that humans can process. Even the most sophisticated organizations\r\nmay struggle with the sheer amount of data generated from modern security systems — hence the need for data-processing tools to reduce this vast amount of data into manageable information that can be processed manually, if\r\nrequired.\r\nAs an example, we'd like to walk through how we hunted for new threats by using a big data processing\r\ntool/library — Apache Spark — to group large amounts of suspicious events into manageable groups. This\r\ntechnique may be useful to organizations of all sizes to handle large amounts of security events efficiently, or it\r\ncould inspire some ideas to improve existing tools.\r\nAlthough we use data generated by our own tools, the method described is generic and can be used by\r\norganizations of all sizes and on varying datasets from several sources, such as Windows logs, security solution\r\nlogs (e.g., SIEM, Cisco Secure Endpoint) or proxy logs. The only requirements for this method are an available\r\nSpark cluster and data stored in a medium that is appropriate for Spark, such as CSV or JSON files in a cloud or a\r\nlarge physical storage system.\r\nIt is also worth mentioning that the method shown here is not the only clustering option. However, it uses\r\nalgorithms that are suited for processing very large volumes of data, is generic enough to be easily adapted to\r\ndifferent datasets, and only makes use of the free and convenient available Spark libraries.\r\nPreparing data for clustering  \r\nThe base concept of the system is very simple: We represent each of our items as a set of \"tokens\"\r\nand then compare how similar the set derived from one item is to the other sets. This allows us to\r\nfind items that are most similar to each other, even if they are subtly different.\r\nThese \"tokens\" are very similar to words in a book. If you represent a book as a set of words, you can identify\r\nbooks that can be grouped together based on the words they share. With this technique, you could group together\r\nEnglish-language books or ones written in Spanish or German. By applying tighter criteria for clustering, we can\r\nidentify groups of books that mention \"malware,\" \"computer,\" or \"vulnerability\" separate from another group that\r\nmay mention \"Dumbledore,\" \"Hagrid\" or \"Voldemort.\"\r\nHowever, before any clustering takes place, we need to load and prepare the data for processing.\r\nThe first step is to load pyspark and import a few necessary libraries:\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 2 of 16\n\nCountVectorizer is part of the machine learning package pyspark.ml and transforms the data into a format that is\r\nused by many ML algorithms. The MinhashLSH package will be used to reduce the amount of data that needs to\r\nbe processed and to calculate a set of similar event pairs. The graphframes library is a pyspark graph library based\r\non Spark dataframes.\r\nOnce the environment is ready, we will start by loading the data and immediately start transforming it. The first\r\ncommands follow:\r\nOn the first line, the data is read from its path using Spark's read method. On the second line, a set of functions are\r\ncalled that concatenate the short_description and argv fields, then split them by each non-alphanumeric character\r\nand finally \"explode\" them, creating one row per word. Finally, on the third line, the words are grouped by the\r\nsystem where they were seen. As a result, for each agent, we will have an array of words that were used in its\r\ncommand line.\r\nThe following image shows the resulting table, with the systems and the commands broken down into words:\r\nEncoding the data in the correct format\r\nMost algorithms from machine learning libraries need the input data in a specific format. In this\r\ncase, the Minhash algorithm requires a numeric vector of features, so we'll use the\r\nCountVectorizer function, which will transform all the unique words present in each of the events\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 3 of 16\n\ninto columns. Events are described by having the count of occurrences of each word on the\r\ncolumn corresponding to that word. For example, imagine that each event contains only one\r\nword, as in the following table:\r\nAfter using the \"CountVectorizer\" encoder, the event table would look like this:\r\nStarting from the tocluster dataframe shown above, the following code shows the operation of transforming this\r\ndata into a vector of numeric events by adding a column called \"features\" to the dataframe.\r\nThis will result in a massive vector, with a huge amount of information that will require a lot of processing power.\r\nLuckily, this can be solved or at least improved using Locality Sensitive Hashing, as we will see in the next\r\nsection.\r\nThe following image shows the resulting data frame, with a features column, that contains a numeric vector of the\r\ncounts of each word:\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 4 of 16\n\nMinHash LSH\r\nLocality Sensitive Hashing (LSH) is a fairly complex topic. However, Spark has some nice\r\nmachine learning libraries that allow users to use the power of the technique without having to\r\nknow all the details.\r\nUnlike other hash algorithms, LSH seeks to maximize, rather than minimize, hash collisions. Therefore, the user\r\ncan compute a set of LSH hashes for each event where the number of common words or features between\r\ndifferent events we use to cluster similar events is represented by the number of common hashes. This means that\r\nwe can then discard the enormous amount of feature columns and use only a few hashes to calculate the similarity.\r\nThe use of Minhash LSH makes it possible to calculate the similarity of very large datasets, using Spark's power\r\nfor distributed processing in a large number of systems. Trying to use a simple machine learning library on a\r\nsingle system to cluster this amount of information would be almost impossible.\r\nThe code section using MinHash LSH is shown below:\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 5 of 16\n\nIn this case, we used 10 hashes. There is a tradeoff between the processing time and the accuracy of the system by\r\nincreasing the number of computed hashes per record. More hashes would return more accurate results but require\r\nmore processing time, while fewer hashes would return less accurate results while requiring less processing time.\r\nAlthough there are complex ways to select the optimal value, for manual threat hunting, a bit of trial and error is\r\nusually good enough to arrive at a number of hashes that work.\r\nThe following image shows the data frame with the resulting hashes. Now, instead of a huge vector, we have only\r\nan array of length 10 on each row that needs to be analysed.\r\nComputing similarity\r\nThere are many techniques to calculate the similarity between events. In this case, we calculate\r\nthe Jaccard distance between events to determine what is similar and what is not. This means\r\ncalculating the number of words that A and B have in common and dividing it by the set of words\r\nthey don't have in common.\r\nThe code to compute the similarity between all the events is:\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 6 of 16\n\nWe have to decide how similar two events have to be before we cluster them together. In this case, we use the\r\nvalue of 0.2 as the maximum Jaccard distance between events that we require to consider them similar. Zero\r\nwould mean exactly equal events and one would mean completely different events.\r\nAfter selecting this value and preparing the data, we must define the quality of the clusters. If they're too small we\r\nwill have too many clusters to work with. If they're too large, we'll have a few low-quality clusters containing\r\nrelatively different events. The choice of value is dependent on the nature of the dataset and the objectives of\r\nclustering. Again, some trial and error is required to create the most useful clusters for each case.\r\nThe following image shows the resulting table of similar pairs:\r\nGrouping similar events\r\nAfter calculating the similarity between events, which essentially cross-joints the table, we have a\r\nhuge table with pairs of similar events. We can query for events similar to any particular event\r\nvery quickly. However, what we are really after is a limited number of groups of similar\r\ncommands.\r\nThere are many ways to do this (as there are with all parts of the presented method), but here we'll identify\r\ncommunities of connected points in a graph.\r\nWe used a very powerful Spark library called \"Graphframes\". This library works with the relationship between\r\nnodes (or vertices) and their connections (or edges) and executes known graph algorithms to extract information\r\nfrom these relations.\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 7 of 16\n\nIn this case, we used its connectedComponents algorithm to group sets of similar nodes. The image below shows a\r\ntheoretical example of how this would look.\r\nIn the example above, there are two communities and a singleton. The blue community is very straightforward as\r\nall the nodes are similar to each other. The H node is only similar to itself, so it is a community of one. And\r\nfinally, the yellow community shows that although there is no specific similarity between B and C, they are part of\r\nthe same community since there are similarities between other members of the same community.\r\nThe following code computed the communities of similar event pairs calculated in the previous step.\r\nThe \"v\" variable contains all the node IDs, and the variable \"e\" contains all the similar pairs calculated in the\r\nprevious step. After creating a GraphFrame object with these values, calculating the communities is as simple as\r\ninvoking the connectedComponents method of the Graph object.\r\nThe following graph shows the communities that the described methodology generated. Each dot represents one\r\nsystem and, as expected, the communities are not connected to each other. There are various communities with\r\ndifferent sizes and colors. Either way, the most important aspect is that the search space for a human researcher\r\nwas reduced dramatically.\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 8 of 16\n\nDigging into the results\r\nNow that we have a set of clusters to research, we'll perform a deeper analysis on\r\nthree representative clusters that highlight different attributes of the described\r\nmethod.\r\nSummarising the examples that follow:\r\nThe first example shows how a community of similar but not identical commands was found that contained\r\none common strange word.\r\nThe second example shows how a choice made previously in the way the data was prepared for clustering\r\nproduced a useful result with \"better\" clusters and how the system was able to isolate a few attack patterns.\r\nFinally, by limiting the clusters to only those that have only recent occurrences, we detected a relevant\r\nchange in the behaviour of a known financially motivated threat actor.\r\nExample 1: Who IsErik?\r\nThis cluster stood out to us:\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 9 of 16\n\nThe first thing that's unique is there's a common string to all the events — \"--IsErik\" — that begs the question:\r\nWho is Erik and what is he doing in these systems?\r\nIdentifying the threat that these events relate to is pretty straightforward. A quick Google search for the \"isErik\"\r\nstring reveals numerous articles describing it as an artifact of a known persistent adware family.\r\nWhat is interesting about this cluster, and the reason it was selected as an example, is that the name and path of the\r\nfile that wscript.exe executes, as well as the hex values that follow, are different for each event. This shows that\r\nthe clustering system is doing its job of accepting small differences.\r\nExample 2: Hiding a miner on Exchange Servers\r\nWe'll also look at a set of clustered events that demonstrates the importance of selecting and\r\npreparing data before clustering.\r\nIt may seem strange that a set of completely different commands were joined into the clustered events below.\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 10 of 16\n\nHowever, this was intended — all the commands run in a host started the clustering, not a command-line event. So\r\nwe'll group the words with the same commands on a host. As a result, the clusters contain not only groups of\r\nsimilar commands but also groups of similar command combinations.\r\nThis has one big advantage: context. Different attacks are grouped separately. Even when some of the events are\r\nsimilar between groups as, for example, when multiple unrelated attacks are exploiting a common vulnerability.\r\nSo, what attack does this cluster reveal? Performing further analysis on one of the affected systems we observed\r\nthe following sequence of events:\r\nAttack vector\r\nW3wp.exe's execution of a cmd.exe was our first sign of suspicious activity. W3wp is the IIS worker process\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 11 of 16\n\nalso used in Exchange servers. Knowing that this server has an internet-exposed Exchange Server and the\r\nnumerous critical vulnerabilities recently published and widely exploited, we can assume that this is the\r\nattack vector: exploiting one of the Exchange vulnerabilities.\r\nInstallation stealth and persistence\r\nThe first command executes a base64-encoded PowerShell payload that can be decoded into:\r\nThe command contains minimal obfuscation and appears to be attempting to download and execute something\r\nfrom the URL https://122[.]10[.]82[.]109:8080/connect, taking special care to set a specific user agent, possibly to\r\nevade automatic analysis of the URL.\r\nThe additional PowerShell code is responsible for the remaining installation and execution.\r\nIt downloads the final paypload and writes it to the file system as C:\\ProgramData\\Microsoft\\conhost.exe. In the\r\nfollowing steps, the script deletes PowerShell logs and registers the final payload as a Windows service using a\r\nlong command line with some strange permission settings:\r\nA quick Google search reveals that these are used to make the service hidden and unremovable using the regular\r\nWindows administration tools, without some additional actions.\r\nFinal payload\r\nFinally, the C:\\ProgramData\\Microsoft\\conhost.exe file (the same one that is used for the hidden service) is\r\nexecuted by the process powershell.exe.\r\nThis file (sha256: 81A6DE094B78F7D2C21EB91CD0B04F2BED53C980D8999BF889B9A268E9EE364C) is\r\nXMRig, a known cryptocurrency miner. We can confirm this by looking at its communications and pool login ID.\r\nWhile this attack is not particularly sophisticated, it uses some interesting tricks to hide and persist its execution in\r\nthe system.\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 12 of 16\n\nIt's interesting that these events were grouped even though they were not very similar and other attacks against\r\nExchange Servers that have similar initial access commands were not. By looking at this technique, a researcher\r\ncould identify the different types of ongoing attacks against Exchange Server.\r\nThreat actor TA551's Bazar\r\nThis final example highlights how clustering can be done in a time-bounded way, to reveal only\r\nattacks happening on a restricted time frame. In this case, clustering was limited only to what\r\nhappened in the last few days. As a result, the recent clusters brought our attention to a malware\r\ncampaign by a known threat actor that, once further researched, showed some changes in the\r\nactor's usual activities.\r\nWe started by looking at the following listing of the cluster events:\r\nThe cluster contained a suspicious combination of commands, using DLL files with a JPEG extension and\r\nregistering them as services (IcedID is known to have been distributed this way). However, after analysis, we\r\nconcluded that BazarBackdoor was the actual malware being distributed in this campaign, and, based on the TTPs,\r\nthat TA551 was the probable adversary in this case.\r\nAdditional analysis on one of the affected systems uncovered the following sequence of relevant events:\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 13 of 16\n\nAttack vector\r\nAs can be seen in Steps 1 and 2, the attack started with an email with an attachment. A ZIP file named\r\n\"request.zip,\" containing a .doc file named \"official paper,08.21.doc.\" This ZIP file is encrypted to avoid\r\ndetection by email protection systems.\r\nMalware installation\r\nWhen the .doc file was opened, a macro wrote an .hta file to disk and used mshta.exe to execute its contents.\r\nThe .hta file is the downloader that connects to the server on 185[.]53[.]46[.]33, downloads a file and writes it to\r\ndisk as a .jpg file and registers it as a service using regsrv32.exe. The .jpg file is actually a .dll file that contains\r\nthe backdoor to be installed on the system.\r\nBazarbackdoor\r\nOnce executed, the DLL (devDivEx.jpg) connects to the host 167[.]172[.]37[.]20.\r\nWith OSINT, we found several samples with similar names (devDivEx.jpg) that connect to the same host. We\r\nidentified these with memory analysis rules and by the use of a DGA with the .bazar TLD as being\r\nBazarbackdoor. The following are examples of these samples:\r\nC96ee44c63d568c3af611c4ee84916d2016096a6079e57f1da84d2fdd7e6a8a3\r\nf7041ccec71a89061286d88cb6bde58c851d4ce73fe6529b6893589425cd85da\r\nThe Trickbot installation\r\nAround one hour after the Bazar infection occurred, the svchost.exe process started performing additional\r\nsuspicious activities:\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 14 of 16\n\nAs shown, svchost.exe (which was running the Bazarbackdoor process) writes a DLL file to disk, and starts it\r\nusing rundll32.exe. A few seconds later, the rundll32.exe process starts connecting to the IP addresses\r\n103[.]140[.]207[.]110, 103[.]56[.]207[.]230, 45[.]239[.]232[.]200.\r\nThese IPs are easily identifiable through OSINT as Trickbot C2 IP addresses, and multiple Trickbot samples can\r\nbe found connecting to them on public sandbox execution reports.\r\nThe threat actor\r\nWe found that there are several of the attacker's TTPs that are similar to those of the threat actor TA551,\r\nleading us to believe with moderate confidence that this is the threat actor. For example:\r\nThe request.zip file name.\r\nUse of email with encrypted ZIP attachment.\r\nUse of Microsoft Word macros.\r\nUse of an HTA file as downloader.\r\nUse of DLL with a .jpg extension in the c:\\users\\public directory.\r\nRegistering the JPEG file as a service.\r\nThe format of the commands used to perform each of these activities.\r\nWhile searching for a match for the observed TTPs and IOCs, we found that this has also been observed by other\r\nresearchers who recently tweeted and blogged about TA551 starting to drop BazarBackdoor and Trickbot.\r\nTA551 is a known, financially motivated attacker that distributes several other malware families in the past (e.g.,\r\nUrsnif, Valak, IcedID). Distributing BazarBackdoor is a fairly recent change that deserves network defenders'\r\nattention. This example demonstrates how, by selecting only recent clusters, it is possible to identify threats that\r\nare happening recently.\r\nConclusion\r\nAs attacks become more frequent and impactful, one of the most powerful\r\nweapons that organizations have is data. Security is not something that you can\r\nmaster by purchasing a single software or hardware solution. Several layers of\r\ndefense are needed, and still, attacks will get through occasionally. Without data,\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 15 of 16\n\nsecurity teams are blind to ongoing and past attacks that may have passed the\r\nexisting layers of protection.\r\nSecurity tools can produce very large amounts of data. Thankfully, there are several great tools that help with\r\nquerying large volumes of data with more or less flexibility. At Talos, Spark is one of the tools we use, for its\r\nflexibility and ability to handle very large data sets.. These data processing tools should be a powerful tool in the\r\narsenal of security teams and this blog post walks through one technique we use that we find particularly useful.\r\nHopefully after reading it, it helps \"spark\" a new idea for data processing or \"spark\" the interest to use and explore\r\nthese tools.\r\nIOC's in this post\r\nSamples:\r\nXMRig Miner:\r\n81A6DE094B78F7D2C21EB91CD0B04F2BED53C980D8999BF889B9A268E9EE364C\r\nBazarBackdoor:\r\nC96ee44c63d568c3af611c4ee84916d2016096a6079e57f1da84d2fdd7e6a8a3\r\nf7041ccec71a89061286d88cb6bde58c851d4ce73fe6529b6893589425cd85da\r\nNetwork IOC's:\r\nIP and url for miner downloader:\r\n122[.]10[.]82[.]109\r\nhttps://122[.]10[.]82[.]109:8080/connect\r\nBazar backdoor downloaded from:\r\n185[.]53[.]46[.]33\r\nBazarBackdor C2:\r\n167[.]172[.]37[.]20\r\nTrickbot C2:\r\n103[.]140[.]207[.]110, 103[.]56[.]207[.]230, 45[.]239[.]232[.]200\r\nSource: https://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nhttps://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html\r\nPage 16 of 16",
	"extraction_quality": 1,
	"language": "EN",
	"sources": [
		"Malpedia"
	],
	"references": [
		"https://blog.talosintelligence.com/2021/10/threat-hunting-in-large-datasets-by.html"
	],
	"report_names": [
		"threat-hunting-in-large-datasets-by.html"
	],
	"threat_actors": [
		{
			"id": "26a04131-2b8c-4e5d-8f38-5c58b86f5e7f",
			"created_at": "2022-10-25T15:50:23.579601Z",
			"updated_at": "2026-04-10T02:00:05.360509Z",
			"deleted_at": null,
			"main_name": "TA551",
			"aliases": [
				"TA551",
				"GOLD CABIN",
				"Shathak"
			],
			"source_name": "MITRE:TA551",
			"tools": [
				"QakBot",
				"IcedID",
				"Valak",
				"Ursnif"
			],
			"source_id": "MITRE",
			"reports": null
		},
		{
			"id": "40b623c7-b621-48db-b55b-dd4f6746fbc6",
			"created_at": "2024-06-19T02:03:08.017681Z",
			"updated_at": "2026-04-10T02:00:03.665818Z",
			"deleted_at": null,
			"main_name": "GOLD CABIN",
			"aliases": [
				"Shathak",
				"TA551 "
			],
			"source_name": "Secureworks:GOLD CABIN",
			"tools": [],
			"source_id": "Secureworks",
			"reports": null
		},
		{
			"id": "90f216f2-4897-46fc-bb76-3acae9d112ca",
			"created_at": "2023-01-06T13:46:39.248936Z",
			"updated_at": "2026-04-10T02:00:03.260122Z",
			"deleted_at": null,
			"main_name": "GOLD CABIN",
			"aliases": [
				"Shakthak",
				"TA551",
				"ATK236",
				"G0127",
				"Monster Libra"
			],
			"source_name": "MISPGALAXY:GOLD CABIN",
			"tools": [],
			"source_id": "MISPGALAXY",
			"reports": null
		},
		{
			"id": "04e34cab-3ee4-4f06-a6f6-5cdd7eccfd68",
			"created_at": "2022-10-25T16:07:24.578896Z",
			"updated_at": "2026-04-10T02:00:05.039955Z",
			"deleted_at": null,
			"main_name": "TA551",
			"aliases": [
				"G0127",
				"Gold Cabin",
				"Monster Libra",
				"Shathak",
				"TA551"
			],
			"source_name": "ETDA:TA551",
			"tools": [
				"BokBot",
				"CRM",
				"Gozi",
				"Gozi CRM",
				"IceID",
				"IcedID",
				"Papras",
				"Snifula",
				"Ursnif",
				"Valak",
				"Valek"
			],
			"source_id": "ETDA",
			"reports": null
		}
	],
	"ts_created_at": 1775434676,
	"ts_updated_at": 1775792243,
	"ts_creation_date": 0,
	"ts_modification_date": 0,
	"files": {
		"pdf": "https://archive.orkl.eu/fb130532fefc0d6433fe82fd8ba7d9f6c4df6cf9.pdf",
		"text": "https://archive.orkl.eu/fb130532fefc0d6433fe82fd8ba7d9f6c4df6cf9.txt",
		"img": "https://archive.orkl.eu/fb130532fefc0d6433fe82fd8ba7d9f6c4df6cf9.jpg"
	}
}