This post is based on research from IMC 2011 – a pdf is available under my publications. Any views or opinions discussed herein are my own and are based solely on research I conducted prior to working at Twitter.
As Twitter continues to grow in popularity, so does the marketplace for abusing Twitter as a service for spamming. In order to understand this phenomenon, we tracked the behavior of 1.1 million accounts suspended by Twitter for disruptive activities (e.g. spamming, aggressive following) over the course of seven months. In the process, we collected a dataset of 80 million tweets sent by spam accounts in addition to 37.8 million URLs presumed to direct to spam. What follows is an analysis of the abuse of online social networks through the lens of the tools, techniques, and support infrastructure spammers rely upon.
Our Dataset and All Its Caveats
Our dataset was derived from Twitter’s garden hose which provides a sample of all tweets appearing on Twitter. More precisely, we received 150 tweets/second, amounting to 12 million tweets per day in the absence of network outages or errors. Rather than receive generic tweets, we specifically requested tweets that contain URLs, simply because they are more interesting from a spam perspective; they have a clear monetization angle we could manually analyze. In total, we collected 1.8 billion tweets from August 2010 through March 2011, only 80 million of which turned out to be spam. Here was our daily sample size, with breaks indicating an outage in our collection (oops, measurement is hard!):
Due to rate limiting performed by Twitter, our sample rate was strictly decreasing; we were limited to 150 tweets/second, while Twitter continued to grow in volume. As a result, the total fraction of tweets with URLs we received dropped from 90% at the onset of our study down to 60% at its completion:
For further details on our collection methodology, validation, and sampling, check out the paper.
State of Twitter Spam – How Much?
Using our dataset, we counted the number of spam tweets sent by accounts suspended by Twitter each day from August 2010 through March 2011. The results are shown here:
Our calculations are a strict lower-bound as we rely on Twitter to identify spam; something we know is imperfect. Based on manual analysis, we estimated that Twitter caught 37% of spam, which means the actual number of spam tweets per day is likely much higher. Nevertheless, we can discern that at least half a million spam tweets are sent each day. Interestingly, the highest volume of spam preceded the holiday season; even spammers have gift suggestions for you and your family.
One noteworthy observation is that while the total volume of spam appears to be flat, our sample size was decreasing. This would indicate that spam on Twitter was actually increasing over time.
Spam Accounts – How Many, How Active, How Long?
To be continued…