Protecting against the DarkWeb Dumps
So with my child reaching 4 months I’ve only had scattered moments to do things such as revision so instead I’ve been working a small project to attempt to pick up and detect mentions of my current employer on the DarkWeb.
The script works well and while it has some limitations (which I will cover later) it does a remarkable job in doing what it does. Let me talk you through it.
The principal
The project involves a python script that pulls the listings for several likely phrases from the search engine ahima.fi before recursively crawling through the found onion links until it reaches a certain depth. At each site it visits it reviews the html, checks for a keyword match and records it a match is found.
As for the reason it was created. The aim going in was to pick up leaked data dumps involving my organisation, the data originating either from us or from another org.
Since I’m writing this article after creation I won’t go through the construction process but here’s a breakdown of the features.
Process
- Checks for Tor Proxy. Gracefully exits if not running (in this instance we’re using Tor Expert Bundle for a headless experience)
- Runs through a prepopulated list of search queries and returns resulting onion addresses.
- – Begins recursively crawling keeping the results in memory.
– Sites featuring certain keywords (you can guess…) are dropped and the links are not collected.
– Crawl wait times and max depth are set an environmental variable.
– Crawled pages have their HTML scanned for keywords. Further onion links are added to the queue (as the next layer) - Results are collated and saved.
– Crawled link are saved to visited_links_cache.json where they record the URL, date crawled and other metadata.
– If a keyword match is found a snippet is taken and that alongside metadata is saved to a different .json file keyword_matches which records the URL and a text snippet of the area of HTML where the keyword is detected. - At the end of the script, a connection is made to Zoho (which acts as a mail relay) and emails the results. |
- A crontab task is setup to scan daily.

Limitations
- As this script utilises requests it cannot parse any site using Javascript
- While it picks up a large proportion of the darkweb, well hidden websites may not be indexed or referred to by other sites.
- Measures such as login to view can be preventing keyword matching.
Roadmap
- Implement JS detection and swap to selenium for those cases.
- Implement categorisation of sites and set up more intensive crawling / fingerprinting of sites such as hacking forums.
- Implement multiple Tor relays to speed up scraping (multi-threading is already implemented so primarily limited by node speed/ built in waits to avoid overloading the exit node.
- Implement lists of known onion links as basis to start crawl (github, etc)
Conclusion
Overall this has been my favorite project for a while. I’ve always specialised in crawlers when it comes to tooling and this project has been right up my alley. That and bring in Github Copilot to assist with the crawling has significantly reduced the coding time allowing the project to be built in weeks rather than months.