{
	"id": "bc3ca56e-c6bd-4f20-8250-e55998c7d800",
	"created_at": "2026-04-06T00:15:35.703634Z",
	"updated_at": "2026-04-10T13:11:58.569321Z",
	"deleted_at": null,
	"sha1_hash": "23dc40dd9a52f43ec2a588ff28ac2e9b07ab2c87",
	"title": "Categorizing Software with Code Families",
	"llm_title": "",
	"authors": "",
	"file_creation_date": "0001-01-01T00:00:00Z",
	"file_modification_date": "0001-01-01T00:00:00Z",
	"file_size": 188515,
	"plain_text": "Categorizing Software with Code Families\r\nArchived: 2026-04-05 12:57:19 UTC\r\nby savage | 2025-01-22\r\nWhen working on a methodology for tracking software, The Vertex Project analysts wanted an approach that\r\nwould give us greater precision in documenting our findings and asking questions about our data. In More Than\r\nMalware Families, we introduced several categories into which we organize software: code families, software\r\nsuites, and software ecosystems. This blog will focus on the most fundamental of the three, code families, and\r\ndescribe how other analyst teams might approach creating code families to categorize tools.\r\nWhat is a Code Family?\r\nIf we want to be extremely precise when identifying, categorizing, and tracking certain kinds of software, then we\r\nmight begin with creating code families. A code family is a set of executable code based on what an analyst has\r\ndetermined to be the same or highly similar source code. The files associated with that family may share an entire\r\ncode base or a subset of key components (functions) that are unique to or strongly representative of the code\r\nfamily.\r\nCode families are intended to be granular and allow for more precise file identification. A tool consisting of\r\nmultiple files will often map to several code families, with each family corresponding to one of the files making\r\nup the overall tool. For example, what the industry commonly refers to as \"PlugX\" typically consists of three files:\r\nan executable file (often from a legitimate vendor), a side-loading DLL, and shellcode. To Vertex, only the\r\nshellcode that implements the backdoor functionality is part of the PlugX code family. Although the shellcode and\r\nside-loading DLL work together, they do not share the same or highly similar source code required if they were to\r\nbe the same code family. We could optionally create another named code family if we wanted to track the different\r\nside-loading DLLs, otherwise, we might simply track the executable and side-loading DLL as part of the PlugX\r\necosystem.\r\nCode families are not inherently malicious - analysts can create code families to identify software in general and\r\ntrack a broader variety of tools. An analyst may create a code family to track samples of Microsoft’s PsExec, for\r\nexample. Tracking tools, as we noted in our previous blog, can help analysts more easily recognize samples as\r\nthey come across them, as well as provide context and identify tactics, techniques, and procedures associated with\r\nactivity of interest.\r\nWhy Create Code Families?\r\nAnalyst teams can create code families to help with categorizing tools and components, developing detection, and\r\nidentifying changes in tools over time. With code families, research into a tool begins at the code level as analysts\r\ndetermine which key samples of source code within that tool will serve as the anchor for the resulting code family.\r\nAfter selecting those code samples, which we refer to as anchor functions, the analyst can identify the\r\nhttps://vertex.link/blogs/categorizing-software-with-code-families/\r\nPage 1 of 5\n\ncorresponding files containing those functions and mark them as associated with that code family. From there, an\r\nanalyst might work to document relationships between different tools, creating Software Suites or Software\r\nEcosystems as appropriate.\r\nWhile creating code families has its benefits, this approach is not for everyone and is not a necessary starting point\r\nfor tool identification. For some teams, basing tool identification on code similarities may be too granular an\r\napproach and inconsistent with their analysis needs.\r\nCreating a Code Family\r\nThere are two main approaches to choose from when it comes to creating code families, one of which allows for\r\nhigher fidelity but requires greater resources. The choice in methodology will largely depend upon a combination\r\nof a team’s analysis requirements and available resources. Thus teams should choose the methodology that best\r\naligns with their tasking, analytic outputs, and available resources.\r\nApproach 1: Basing the Code Family Off of Anchor Functions\r\nOf the two approaches, the most high fidelity method for creating a code family involves basing it off of one or\r\nmore anchor functions representing key aspects of the source code. An anchor function is the seed of the code\r\nfamily cluster, similar to how a threat cluster seed is the starting point for a threat cluster. As such, while a code\r\nfamily can have multiple anchor functions, each should be unique to that code family.\r\nIdeally, an anchor function will be representative of or tied to a key capability of the executable source code.\r\nHowever, in many instances it is not the capability itself that is unique but the way in which it is implemented. For\r\nexample, some backdoors obfuscate or encrypt the names of API calls made to the host operating system (e.g.,\r\nCreateFileA) to mask their functionality. These strings are decoded or decrypted at runtime. The specific\r\nalgorithms (functions) used and their implementation may be unique to the backdoor, and could therefore be a\r\ngood candidate for an anchor function for the backdoor's code family.\r\nA team’s approach to identifying anchor functions will depend upon its analysis requirements and resourcing. The\r\nmost precise method is also the most resource intensive, as it involves relying on a malware reverse engineer to\r\nidentify anchor functions through symbolic execution. Teams without dedicated reverse engineering support may\r\nuse tools like Vivisect, which identifies symbolic functions that analysts can use to select anchor functions.\r\nSymbolic execution is a robust method of identification as it targets the logic behind the instructions found in the\r\ncode, and will therefore persist across changes in bytes. In contrast, comparing instruction byte code with\r\nsomething like YARA is more fragile and poses a greater risk of false negatives. Analysts working with a reverse\r\nengineer can generate less fragile signatures for anchor functions by omitting relocations that would make the\r\ncode position dependent. This would help ensure that the signatures still match against files containing the same\r\ninstruction sequences, even if they are loaded at a different address in RAM.\r\nApproach 2: Generally Identifying the Existence of a Code Family\r\nAnother approach would be to infer the existence of a code family among a set of highly similar samples, rather\r\nthan identifying specific anchor functions to precisely define the code family. Instead of relying on symbolic code\r\nhttps://vertex.link/blogs/categorizing-software-with-code-families/\r\nPage 2 of 5\n\nanalysis, this tactic involves using static and dynamic analysis to identify similarities implying the existence of a\r\nshared code family among files. These similarities may include a combination of strings found in a binary, format\r\nstrings in a URL, and execution behaviors, among others.\r\nAlthough this approach is more accessible and less resource-intensive than identifying anchor functions, it also\r\nposes a higher risk of false positives. YARA rules that rely on strings and other application data are less accurate\r\nthan those focused on identifying functions, as the latter targets the code itself, rather than the data the code uses.\r\nWhile an analyst can select a range of similarities as evidence of a code family, some will be higher fidelity than\r\nothers. An analyst must therefore be cognizant of what shared traits they are noting as a proximation for the code\r\nfamily, as selecting something insufficiently unique can result in false positives.\r\nCode Families in Practice: The Carrotstick Backdoor\r\nSo what does creating a code family look like in practice, and how do we then represent the results in Synapse?\r\nLet’s take a look at a code family I created for a backdoor Cybersec Sentinel originally reported on in early June\r\n2024. According to CyberSec Sentinel, Elastic, and others, phishing emails use employment-related lures to entice\r\nrecipients to click on a malicious link, which, after a series of redirects, delivers a Javascript file that downloads a\r\nbackdoor. While Cybersec Sentinel, Elastic, and others refer to the backdoor as WarmCookie, I opted to name our\r\ninternal code family Carrotstick to differentiate between our own analysis and that of other organizations.\r\nIn this instance, I sought to generally identify the existence of a code family among the backdoor samples, rather\r\nthan try to specifically identify anchor functions upon which to base the code family. My evidence for the code\r\nfamily included a mix of execution behavior, such as:\r\nDownloading a DLL to a temp directory with a random name and file extension, while also copying the\r\nDLL to C:/ProgramData/RtlUpd/RtlUpd.dll ;\r\nUsing rundll32.exe to launch the DLL with the parameters \"Start, /p\" for persistence;\r\nHaving a hard-coded GUID-like string as a mutex; and\r\nCommunicating with a hardcoded IP address over HTTP.\r\nI also used Elastic's YARA rule, although I edited it to include a different combination of strings and conditions.\r\nAfter identifying the parameters for the Carrotstick code family, I created a risk:tool:software node and linked\r\nit to an it:prod:soft node to represent it in Synapse:\r\nhttps://vertex.link/blogs/categorizing-software-with-code-families/\r\nPage 3 of 5\n\nThe it:prod:soft node documents Carrotstick as a type of software, while the risk:tool:software node\r\n(linked to it:prod:soft through the risk:tool:software:soft property), shows that Carrotstick is a tool\r\nassociated with malicious activity. I can use additional forms within Synapse to track further details, like\r\nversioning information ( it:prod:softver ) and associated techniques ( ou:technique ) as well.\r\nAfter creating risk:tool:software and it:prod:soft nodes to represent Carrotstick at a high level, I tagged\r\nthe file:bytes nodes representing Carrotstick samples and their associated hash:sha256 , hash:sha1 , and\r\nhash:md5 nodes with #cno.code.carrotstick to keep track of them. At this point, I have both higher level\r\ndetails about Carrotstick reflected in Synapse, as well as actual samples of the code family.\r\nYou can review this data in the Vertex Intel-Sharing Instance (register here for access). In the TLP-Green view,\r\nquery risk:tool:software:soft:name=carrotstick to lift the risk:tool:software node representing the\r\nCarrotstick backdoor. You can then use the Explore button to pivot to the associated syn:tag nodes (and then\r\nagain from there, to view all nodes with the #cno.code.carrotstick tag), as shown below:\r\nAlternatively, you can query #cno.code.carrotstick to lift the tagged nodes directly.\r\nCategorizing with Code Families\r\nWithin The Vertex Project, we create code families to allow for greater granularity and precision in identifying\r\nsoftware by doing so at the code level. Using code families to categorize software allows us to deconstruct a tool\r\ndown to the anchor function(s), or key components representing the code family. As noted in this blog, there are\r\nmultiple approaches that teams may take when it comes to identifying anchor functions and creating code\r\nfamilies, from those that offer higher fidelity but are more resource intensive, to those that require fewer resources\r\nbut a greater risk of false positives. As always, we encourage teams to choose the approach that is most\r\nappropriate for their use case.\r\nIn a following blog, we'll discuss Software Suites and Software Ecosystems, as well as walk through creating a\r\nSoftware Ecosystem to track indicators associated with our Carrotstick backdoor.\r\nhttps://vertex.link/blogs/categorizing-software-with-code-families/\r\nPage 4 of 5\n\nSource: https://vertex.link/blogs/categorizing-software-with-code-families/\r\nhttps://vertex.link/blogs/categorizing-software-with-code-families/\r\nPage 5 of 5",
	"extraction_quality": 1,
	"language": "EN",
	"sources": [
		"Malpedia"
	],
	"origins": [
		"web"
	],
	"references": [
		"https://vertex.link/blogs/categorizing-software-with-code-families/"
	],
	"report_names": [
		"categorizing-software-with-code-families"
	],
	"threat_actors": [],
	"ts_created_at": 1775434535,
	"ts_updated_at": 1775826718,
	"ts_creation_date": 0,
	"ts_modification_date": 0,
	"files": {
		"pdf": "https://archive.orkl.eu/23dc40dd9a52f43ec2a588ff28ac2e9b07ab2c87.pdf",
		"text": "https://archive.orkl.eu/23dc40dd9a52f43ec2a588ff28ac2e9b07ab2c87.txt",
		"img": "https://archive.orkl.eu/23dc40dd9a52f43ec2a588ff28ac2e9b07ab2c87.jpg"
	}
}