{
	"id": "c8da154c-2ac6-41ea-ad46-0622fe2008c3",
	"created_at": "2026-04-06T00:22:07.674434Z",
	"updated_at": "2026-04-10T03:21:15.314947Z",
	"deleted_at": null,
	"sha1_hash": "57bf4ecc98fa3c7899d38adea2c680dd3a582d03",
	"title": "A tale of Phobos - how we almost cracked a ransomware using CUDA",
	"llm_title": "",
	"authors": "",
	"file_creation_date": "0001-01-01T00:00:00Z",
	"file_modification_date": "0001-01-01T00:00:00Z",
	"file_size": 2053509,
	"plain_text": "A tale of Phobos - how we almost cracked a ransomware using\r\nCUDA\r\nArchived: 2026-04-05 18:46:17 UTC\r\nAbstract: For the past two years we've been tinkering with a proof-of-concept decryptor for the Phobos family\r\nransomware. It works, but is impractical to use for reasons we'll explain here. Consequently, we've been unable to\r\nuse it to help a real-world victim so far. We've decided to publish our findings and tools, in hope that someone will\r\nfind it useful, interesting or will continue our research. We will describe the vulnerability, and how we improved\r\nour decryptor computational complexity and performance to reach an almost practical implementation. The\r\nresulting proof of concept is available at CERT-Polska/phobos-cuda-decryptor-poc.\r\nWhen, what and why\r\nPhobos is the innermost and larger of the two natural satellites of Mars, with the other being Deimos. However,\r\nit's also a widespread ransomware family that was first observed in the early 2019. Not a very interesting one – it\r\nshares many similarities with Dharma and was probably written by the same authors. There is nothing obviously\r\ninteresting about Phobos (the ransomware) – no significant innovations or interesting features.\r\nWe've started our research after a few significant Polish organizations were encrypted by Phobos in a short period\r\nof time. After that it has become clear, that the Phobos' key schedule function is unusual, and one could even say –\r\nbroken. This prompted us to do further research, in hope of creating a decryptor. But let's not get ahead of\r\nourselves.\r\nReverse Engineering\r\nLet's skip the boring details that are the same for every ransomware (anti-debug features, deletion of shadow\r\ncopies, main disk traversal function, etc). The most interesting function for us right now is the key schedule:\r\nhttps://cert.pl/en/posts/2023/02/breaking-phobos/\r\nPage 1 of 11\n\nThe curious part is that, instead of using one good entropy source, malware author decided to use multiple bad\r\nones. They include:\r\nTwo calls to QueryPerformanceCounter()\r\nGetLocalTime() + SystemTimeToFileTime()\r\nGetTickCount()\r\nGetCurrentProcessId()\r\nGetCurrentThreadId()\r\nFinally, a variable but deterministic number of SHA-256 rounds is applied.\r\nOn average, to check a key we need 256 SHA-256 executions and a single AES decryption.\r\nThis immediately sounded multiple alarms in our heads. Assuming we know the time of the infection with 1\r\nsecond precision (for example, using file timestamps, or logs), the number of operations needed to brute-force\r\neach component is:\r\nOperation No. operations\r\nQueryPerformanceCounter 1 10**7\r\nQueryPerformanceCounter 2 10**7\r\nGetTickCount 10**3\r\nSystemTimeToFileTime 10**5\r\nGetCurrentProcessId 2**30\r\nhttps://cert.pl/en/posts/2023/02/breaking-phobos/\r\nPage 2 of 11\n\nOperation No. operations\r\nGetCurrentThreadId 2**30\r\nIt's obvious, that every component can be brute-forced independently ( 10**7 is not that much for modern\r\ncomputers). But can we recover the whole key?\r\nAssumptions, assumptions (how many keys do we actually need to check?)\r\nInitial status (141 bits of entropy)\r\nBut by a simple multiplication, number of keys for a naïve brute-force attack is:\r\n\u003e\u003e\u003e 10**7 * 10**7 * 10**3 * 10**5 * 2**30 * 2**30 * 256\r\n2951479051793528258560000000000000000000000\r\n\u003e\u003e\u003e log(10**7 * 10**7 * 10**3 * 10**5 * 2**30 * 2**30 * 256, 2)\r\n141.08241808752197 # that's a 141-bit number\r\nThis is... obviously a huge, incomprehensibly large number. No slightest chance to brute-force that.\r\nBut we're not ones to give up easily. Let's keep thinking about that.\r\nMay I have your PID, please? (81 bits of entropy)\r\nLet's start with some assumptions.\r\nWe already made one: thanks to system/file logs we know the time with 1 second precision. One such source can\r\nbe the Windows event logs:\r\nBy default there is no event that triggers for every new process, but with enough forensics analysis it's often\r\npossible to recover PID and TID of the ransomware process.\r\nhttps://cert.pl/en/posts/2023/02/breaking-phobos/\r\nPage 3 of 11\n\nEven if it's not possible, it's almost always possible to restrict them by a significant amount, because Windows\r\nPIDs are sequential. So we won't usually have to brute-force full 2**30 key space.\r\nFor the sake of this blog post, let's assume that we already know PID and TID (of main thread) of the ransomware\r\nprocess. Don't worry, this is the biggest hand-wave in this whole article. Does it make our situation better at least?\r\n\u003e\u003e\u003e log(10**7 * 10**7 * 10**3 * 10**5 * 256, 2)\r\n81.08241808752197\r\n81 bits of entropy is still way too much to think about brute-forcing, but we're getting somewhere.\r\nΔt = t1 - t0 (67 bits of entropy)\r\nAnother assumption that we can reasonably make, is that two sequential QueryPerformanceCounter calls will\r\nreturn similar results. Specifically, second QueryPerformanceCounter will always be a bit larger than the first\r\none. There's no need to do a complete brute-force of both counters – we can brute-force the first one, and then\r\nguess the time that passed between the executions.\r\nUsing code as an example, instead of:\r\nfor qpc1 in range(10**7):\r\n for qpc2 in range(10**7):\r\n check(qpc1, qpc2)\r\nWe can do:\r\nfor qpc1 in range(10**7):\r\n for qpc_delta in range(10**3):\r\n check(qpc1, qpc1 + qpc_delta)\r\n10**3 was determined to be enough empirically. It should be enough in most cases, though it's just 1ms, and so\r\nit will fail in an event of a very unlucky context switch. Let's try though:\r\n\u003e\u003e\u003e log(10**7 * 10**3 * 10**3 * 10**5 * 256, 2)\r\n67.79470570797253\r\nWho needs precise time, anyway? (61 bits of entropy)\r\n2**67 sha256 invocations is still a lot, but it's getting manageable. For example, this is coincidentally almost\r\nexactly the current BTC hash rate. This means, if the whole BTC network was repurposed to decrypting Phobos\r\nvictims instead of pointlessly burning electricity, it would decrypt one victim per second1.\r\nTime for a final observation: SystemTimeToFileTime may have a precision equal to 10 microseconds. But\r\nGetLocalTime does not:\r\nhttps://cert.pl/en/posts/2023/02/breaking-phobos/\r\nPage 4 of 11\n\nThis means that we only need to brute-force 10**3 options, instead of 10**5 :\r\n\u003e\u003e\u003e log(10**7 * 10**3 * 10**3 * 10**3 * 256, 2)\r\n61.150849518197795\r\nMath time (51 bits of entropy)\r\nThere are no more obvious things to optimize. Maybe we can find a better algorithm somewhere?\r\nObserve that key[0] is equal to GetTickCount() ^ QueryPerformanceCounter().Low . Naive brute-force\r\nalgorithm will check all possible values for both components, but in most situations we can do much better. For\r\nexample, 4 ^ 0 == 5 ^ 1 == 6 ^ 2 = ... == 4 . We only care about the final result, so we can ignore timer\r\nvalues that end up as the same key.\r\nSimple way to do this looks like this:\r\ndef ranges(fst, snd):\r\n s0, s1 = fst\r\n e0, e1 = snd\r\n out = set()\r\n for i in range(s0, s1 + 1):\r\n for j in range(e0, e1 + 1):\r\n out.add(i ^ j)\r\n return out\r\nUnfortunately, this was quite CPU intensive (remember, we want to squeeze as much performance as possible). It\r\nturns out that there is a better recursive algorithm, that avoids spending time on duplicates. Downside is, it's quite\r\nsubtle and not very elegant:\r\nuint64_t fillr(uint64_t x) {\r\n uint64_t r = x;\r\n while (x) {\r\n r = x - 1;\r\n x \u0026= r;\r\n }\r\n return r;\r\n}\r\nuint64_t sigma(uint64_t a, uint64_t b) {\r\nhttps://cert.pl/en/posts/2023/02/breaking-phobos/\r\nPage 5 of 11\n\nreturn a | b | fillr(a \u0026 b);\r\n}\r\nvoid merge_xors(\r\n uint64_t s0, uint64_t e0, uint64_t s1, uint64_t e1,\r\n int64_t bit, uint64_t prefix, std::vector\u003cuint32_t\u003e *out\r\n) {\r\n if (bit \u003c 0) {\r\n out-\u003epush_back(prefix);\r\n return;\r\n }\r\n uint64_t mask = 1ULL \u003c\u003c bit;\r\n uint64_t o = mask - 1ULL;\r\n bool t0 = (s0 \u0026 mask) != (e0 \u0026 mask);\r\n bool t1 = (s1 \u0026 mask) != (e1 \u0026 mask);\r\n bool b0 = (s0 \u0026 mask) ? 1 : 0;\r\n bool b1 = (s1 \u0026 mask) ? 1 : 0;\r\n s0 \u0026= o;\r\n e0 \u0026= o;\r\n s1 \u0026= o;\r\n e1 \u0026= o;\r\n if (t0) {\r\n if (t1) {\r\n uint64_t mx_ac = sigma(s0 ^ o, s1 ^ o);\r\n uint64_t mx_bd = sigma(e0, e1);\r\n uint64_t mx_da = sigma(e1, s0 ^ o);\r\n uint64_t mx_bc = sigma(e0, s1 ^ o);\r\n for (uint64_t i = 0; i \u003c std::max(mx_ac, mx_bd) + 1; i++) {\r\n out-\u003epush_back((prefix \u003c\u003c (bit+1)) + i);\r\n }\r\n for (uint64_t i = (1UL \u003c\u003c bit) + std::min(mx_da^o, mx_bc^o); i \u003c (2UL \u003c\u003c bit); i++) {\r\n out-\u003epush_back((prefix \u003c\u003c (bit+1)) + i);\r\n }\r\n } else {\r\n merge_xors(s0, mask - 1, s1, e1, bit-1, (prefix \u003c\u003c 1) ^ b1, out);\r\n merge_xors(0, e0, s1, e1, bit-1, (prefix \u003c\u003c 1) ^ b1 ^ 1, out);\r\n }\r\n } else {\r\n if (t1) {\r\n merge_xors(s0, e0, s1, mask - 1, bit-1, (prefix \u003c\u003c 1) ^ b0, out);\r\nhttps://cert.pl/en/posts/2023/02/breaking-phobos/\r\nPage 6 of 11\n\nmerge_xors(s0, e0, 0, e1, bit-1, (prefix \u003c\u003c 1) ^ b0 ^ 1, out);\r\n } else {\r\n merge_xors(s0, e0, s1, e1, bit-1, (prefix \u003c\u003c 1) ^ b0 ^ b1, out);\r\n }\r\n }\r\n}\r\nIt's possible that there exists a simpler or faster algorithm for this problem, but authors were not aware of it when\r\nworking on the decryptor. Entropy after that change:\r\n\u003e\u003e\u003e log(10**7 * 10**3 * 10**3 * 256, 2)\r\n51.18506523353571\r\nThis was our final complexity improvement. What's missing is a good implementation\r\nGotta go fast (how fast can we go?)\r\nNaïve implementation in Python (500 keys/second)\r\nOur initial PoC written in Python tested 500 keys per second. Quick calculation shows that this brute-forcing\r\n100000000000 keys will take 2314 CPU-days. Far from practical. But Python is basically the slowest kid in on\r\nblock as far as high performance computing goes. We can do much better.\r\nUsually in situations like this we would implement a native decryptor (in C++ or Rust). But in this case, even that\r\nwould not be enough. We had to go faster.\r\nTime works WONDERS\r\nWe've decided to go for maximal performance and implement our solver in CUDA.\r\nCUDA first steps (19166 keys/minute)\r\nOur first naive version was able to crack 19166 keys/minute. We had no experience with CUDA at the time, and\r\nmade many mistakes. Our GPU utilization stats looked like this:\r\nImproving sha256 (50000 keys/minute)\r\nhttps://cert.pl/en/posts/2023/02/breaking-phobos/\r\nPage 7 of 11\n\nClearly sha256 was a huge bottleneck here (not surprisingly – there are 256 times more sha256 calls than AES\r\ncalls). Most of our work here focused on simplifying the code, and adapting it to the task at hand. For example, we\r\ninlined sha256_update:\r\nWe inlined sha256_init:\r\nWe replaced global arrays with local variables:\r\nWe hardcoded data size to 32 bytes:\r\nAnd made a few operations more GPU-friendly, for example used __byte_perm for bswap.\r\nIn the end, our main loop changed like this:\r\nhttps://cert.pl/en/posts/2023/02/breaking-phobos/\r\nPage 8 of 11\n\nBut that's not the end - after this optimization we realized that the code is now making a lot of unnecessary copies\r\nand data transfers. There are no more copies needed at this point:\r\nCombining all this improvements let us improve the performance 2.5 times, to 50k keys/minute.\r\nNow make it parallel (105000 keys/minute)\r\nIt turns out graphic cards are highly parallel. Work is divided into streams, and streams are doing the logical\r\noperations. Especially memcopy to and from graphic card can execute at the same time as our code with no loss of\r\nperformance.\r\nJust by changing our code to use streams more effectively, we were able to double our performance to 105k\r\nkeys/minute:\r\nhttps://cert.pl/en/posts/2023/02/breaking-phobos/\r\nPage 9 of 11\n\nAnd finally, AES (818000 keys/minute)\r\nWith all these changes, we still didn't even try optimizing the AES. After all the lessons learned previously, it was\r\nactually quite simple. We just looked for patterns that didn't work well on GPU, and improved them. For example:\r\nWe changed a naive for loop to manually unrolled version that worked on 32bit integers.\r\nIt may seem insignificant, but it actually dramatically increased our throughput:\r\n...and now do it in parallel (10MM keys/minute)\r\nAt this point we were not able to make any significant performance improvements anymore. But we had one last\r\ntrick in our sleeve - we can run the same code on more than one GPU! Conveniently, we have a small GPU cluster\r\nat CERT.PL. We've dedicated two machines with a total of 12 Nvidia GPUs to the task. By a simple\r\nmultiplication, this immediately increased our throughput to almost 10 million keys per minute.\r\nSo brute-forcing 100000000000 keys will take just 10187 seconds (2.82 hours) on the cluster. Sounds great, right?\r\nhttps://cert.pl/en/posts/2023/02/breaking-phobos/\r\nPage 10 of 11\n\nWhere Did It All Go Wrong?\r\nUnfortunately, as we've mentioned at the beginning, there are a lot of practical problems that we skimmed over in\r\nthis blog post, but that made publishing the decryptor tricky:\r\nKnowledge of TID and PID is required. This makes the decryptor hard to automate.\r\nWe assume a very precise time measurements. Unfortunately, clock drift and intentional noise introduced\r\nto performance counters by Windows makes this tricky.\r\nNot every Phobos version is vulnerable. Before deploying a costly decryptor one needs a reverse engineer\r\nto confirm the family.\r\nEven after all the improvements, the code is still too slow to run on a consumer-grade machine.\r\nVictims don't want to wait for researchers without a guarantee of success.\r\nThis is why we decided to publish this article and source code of the (almost working) decryptor. We hope that it\r\nwill provide some malware researches a new look on the subject and maybe even allow them to decrypt\r\nransomware victims.\r\nWe've published the CUDA source code in a GitHub repository: https://github.com/CERT-Polska/phobos-cuda-decryptor-poc. It includes a short instruction, a sample config and a data set to verify the program.\r\nLet us know if you have any questions or if you were able to use the script in any way to help a ransomware\r\nvictim. You can contact us at info@cert.pl.\r\nRansomware sample analyzed: 2704e269fb5cf9a02070a0ea07d82dc9d87f2cb95e60cb71d6c6d38b01869f66 |\r\nMWDB | VT\r\nSource: https://cert.pl/en/posts/2023/02/breaking-phobos/\r\nhttps://cert.pl/en/posts/2023/02/breaking-phobos/\r\nPage 11 of 11\n\nat CERT.PL. multiplication, We've dedicated this immediately two machines increased with a total our throughput of 12 Nvidia GPUs to almost 10 to the task. million keys per By a simple minute. \nSo brute-forcing 100000000000 keys will take just 10187 seconds (2.82 hours) on the cluster. Sounds great, right?\n   Page 10 of 11",
	"extraction_quality": 1,
	"language": "EN",
	"sources": [
		"Malpedia"
	],
	"references": [
		"https://cert.pl/en/posts/2023/02/breaking-phobos/"
	],
	"report_names": [
		"breaking-phobos"
	],
	"threat_actors": [],
	"ts_created_at": 1775434927,
	"ts_updated_at": 1775791275,
	"ts_creation_date": 0,
	"ts_modification_date": 0,
	"files": {
		"pdf": "https://archive.orkl.eu/57bf4ecc98fa3c7899d38adea2c680dd3a582d03.pdf",
		"text": "https://archive.orkl.eu/57bf4ecc98fa3c7899d38adea2c680dd3a582d03.txt",
		"img": "https://archive.orkl.eu/57bf4ecc98fa3c7899d38adea2c680dd3a582d03.jpg"
	}
}