Reddit is a popular platform for users to share honest and candid opinions about almost everything you can think of. Many users today do Google searches with “site:reddit.com” appended to the end of the search result just to get genuine answers from real humans (and sometimes bots 😉). Many posts on reddit have links within the post’s content pointing to a source, website, or document, and my curiosity led me to wonder - are all these links safe? Are some of them potentially harmful or malicious?
TLDR; I scanned top sub-reddits to look at the number of domains that are malicious from all the links added in Reddit posts. Check out the colab notebook, if you want to try it out.
Enter scrappy Colab notebooks and Pangea APIs
I decided to go on a quick data analysis adventure, so I spun up my colab notebook. Using praw (the famous Python package to interact with the Reddit API), I scraped posts off subreddits that would likely have user-generated links in post content.
My test case was to build a link scanner and run it on subreddits known to post links about scams and malware as well as those subreddits known to produce links to user-generated content that are not intentionally meant to be malware.
I chose r/CryptoScams as my test case for subreddits that intentionally might contain links to scams and malware and chose r/couponing, and r/bitcoin as sub-reddits linking to user-generated content, which may unintentionally lead users to scams or malware.
How did I run the analysis?
I generated a pair of Reddit API keys through the developer portal, then I used the praw library to fetch posts from any desired subreddit. Then I extracted all the URLs from every post’s SelfText
field (AKA: what Reddit calls post content). Next, I went ahead and extracted the domains from the URLs and then passed them through Pangea’s Domain Intel API which is backed by industry-leading datasets from Domain Tools.
Results: Drum roll please 🥁
Note: I scanned unique domains of the links and reported the number of malicious domains found. For further analysis, Pangea’s URL intel APIs could also be used to scan complete URLs.
No surprise, r/CryptoScams resulted in the highest number of malicious domains. I expected this subreddit to have a high number of malicious domains since people are reporting scams and posting links to the same.
Looking at all of the links in the r/CryptoScams subreddit, and using our Domain Intel API - of the links - both good and bad - the API correctly identified that 67% of the domains associated with the links scanned in the subreddit were found to be malicious.
Next up, in r/bitcoin subreddit, 28.5% of the domains associated with the links scanned in the subreddit were found to be malicious
Next up, in r/couponing subreddit, only 10.5% of the domains associated with the links scanned in the subreddit were found to be malicious
Note: All these tests were done with 1000 of the newest posts on the subreddit and I believe that doing this across the top posts might result in fewer malicious domains as top posts have numerous upvotes implying that they’ve been tested by users. If you’re interested in testing this out for yourself, I encourage you to fork the colab notebook and try it yourself 😅.
Coolio, but what can I do about it?
When I refer to malicious domains, this means that they were classified as malicious by Pangea’s Domain Intel API due to past content or reports that are collected from various sources and research by Domain Tools and updated on a regular basis.
The applications of using the domain intel API are endless! As you saw I used Pangea’s Domain Intel API to conduct the data analysis across subreddits, but you could use the same API in discord bots to scan messages for malicious links and domains or bake them into your apps so that your users don’t accidentally click on links that could potentially contain malware.
Best part of all, you can bake Pangea’s Intel APIs into your app with just a few lines of code. For example, this was the function I wrote to check if a domain was malicious… easy ain’t it? 🤨
config = PangeaConfig(domain=PANGEA_DOMAIN)
intel = DomainIntel(PANGEA_TOKEN, config=config)
def check_pangea_domain(domain: str):
try:
response = intel.reputation(domain=domain, provider="domaintools", verbose=False, raw=True)
print([domain, response.result.data.verdict, response.result.data.category, response.result.raw_data["response"]["risk_score"]])
return([domain, response.result.data.verdict, response.result.data.category, response.result.raw_data["response"]["risk_score"]])
If you’d like to try this demo out yourself on different subreddits, check it out by forking my colab notebook and following the setup instructions.
If you have creative ideas about baking Domain and URL intel into your apps, I’d love to hear it. You can find me on X / Twitter @snpranav
Happy scanning 🙌