{
	"id": "fd361e59-9584-411d-8b9f-efab92b10bc0",
	"created_at": "2026-04-06T00:09:36.575254Z",
	"updated_at": "2026-04-10T13:12:11.831895Z",
	"deleted_at": null,
	"sha1_hash": "8daee59956b47058079d7d55319b5a3f400f708b",
	"title": "Robots.txt tells hackers the places you don't want them to look",
	"llm_title": "",
	"authors": "",
	"file_creation_date": "0001-01-01T00:00:00Z",
	"file_modification_date": "0001-01-01T00:00:00Z",
	"file_size": 33093,
	"plain_text": "Robots.txt tells hackers the places you don't want them to look\r\nBy Darren Pauli\r\nPublished: 2015-05-19 · Archived: 2026-04-02 10:44:47 UTC\r\nMelbourne penetration tester Thiebaud Weksteen is warning system administrators that robots.txt files can give\r\nattackers valuable information on potential targets by giving them clues about directories their owners are trying\r\nto protect.\r\nRobots.txt files tell search engines which directories on a web server they can and cannot read.\r\nWeksteen, a former Securus Global hacker, thinks they offer clues about where system administrators store\r\nsensitive assets because the mention of a directory in a robots.txt file screams out that the owner has something\r\nthey want to hide.\r\n\"In the simplest cases, it (robots.txt) will reveal restricted paths and the technology used by your servers,\"\r\nWeksteen says.\r\n\"From a defender perspective, two common fallacies remain; that robots.txt somewhat is acting as an access\r\ncontrol mechanism [and that] content will only be read by search engines and not by humans.\"\r\nAdministration portals which will more often than not contain vulnerabilities and poor security are regularly\r\nincluded in robot text disallow lists in a bid to obscure the assets.\r\nIdentifying those portals is standard practice for penetration testers who will, as Weksteen does, compile and\r\nupdate detailed lists of subdirectories by harvesting robots.txt files.\r\nThose lists will help speed up the discovery of sensitive assets in future attacks or penetration tests.\r\nHere's how Weksteen says things will go down:\r\n\"During the reconnaissance stage of a web application testing, the tester usually uses a list of known\r\nsubdirectories to brute force the server and find hidden resources.\r\nDepending on the uptake of certain web technologies, it needs to be refreshed on a regular basis.\r\nAs you may see, the directive disallow gives an attacker precious knowledge on what may be worth\r\nlooking at. Additionally, if that is true for one site, it is worth checking for another. \"\r\nWeksteen offers security bods his method for collecting his subdirectory list and the techniques to clean and verify\r\nthe initially large datasets. He whittled some 59,558 sites down to 35,375 which contain robots.txt files.\r\nIn total it requires less than 100 lines of scripting to do this kind of scraping, but could benefit from optimisation\r\nof the algorithms used.\r\nhttps://www.theregister.com/2015/05/19/robotstxt/\r\nPage 1 of 2\n\nThe penetration tester gives examples of exposed assets through some 10,000 unclassified documents directly\r\nlisted in the robot text file for the Israeli Assembly.\r\nThe same keyword generated a string of unclassified assets blocked by the US Department of State but still\r\naccessible via the Internet Archive.\r\nReddit users applied Weksteen's work and discovered the identity of a female student who had seemingly been\r\nstalked. Her name, since redacted, was listed in the file description of an image listed under the robot text to be\r\ndisallowed for indexing.\r\nAdmins would be best excluding assets based on general terms and not through absolute references.\r\nSome more entrepreneurial tech bods say they set up honeypots under fake assets marked disallowed in robot texts\r\nbanning all IPs which request the resource. ®\r\nSource: https://www.theregister.com/2015/05/19/robotstxt/\r\nhttps://www.theregister.com/2015/05/19/robotstxt/\r\nPage 2 of 2",
	"extraction_quality": 1,
	"language": "EN",
	"sources": [
		"MITRE"
	],
	"origins": [
		"web"
	],
	"references": [
		"https://www.theregister.com/2015/05/19/robotstxt/"
	],
	"report_names": [
		"robotstxt"
	],
	"threat_actors": [],
	"ts_created_at": 1775434176,
	"ts_updated_at": 1775826731,
	"ts_creation_date": 0,
	"ts_modification_date": 0,
	"files": {
		"pdf": "https://archive.orkl.eu/8daee59956b47058079d7d55319b5a3f400f708b.pdf",
		"text": "https://archive.orkl.eu/8daee59956b47058079d7d55319b5a3f400f708b.txt",
		"img": "https://archive.orkl.eu/8daee59956b47058079d7d55319b5a3f400f708b.jpg"
	}
}