This post is based on research from Oakland Security & Privacy 2011 – a pdf is available under my publications
Recently we presented our research on Monarch, a real-time system that crawls URLs as they are submitted to web services and determines whether the URLs direct to spam. The system is geared towards environments such as email or social networks where messages are near-interactive and accessed within seconds after delivery.
The two major previous approaches for detecting and filtering spam include domain and IP blacklists for email and account-based heuristics in social networks which attempt to detect abusive user behavior. However, these approaches fail to protect web services. In particular, blacklists are too inaccurate and slow in listing new spam URLs. Similarly, account-based heuristics incur delays between a fraudulent account’s creation and its subsequent detection due to the need to build a history of (mis-)activity. Furthermore, these heuristics for automation fail to detect compromised accounts that exhibit a mixture of spam and benign behaviors. Given these limitations, we seek to design a system that operates in real-time to limit the period users are exposed to spam content; provides fine-grained decisions that allow services to filter individual messages posted by users; but functions in a manner generalizable to many forms of web services.
To do this, we develop a cloud-based system for crawling URLs in real-time that classifies whether a URL’s content, underlying hosting infrastructure, or page behavior exhibits spam properties. This decision can then be used by web services to either filter spam or as a signal for further analysis.
When we developed Monarch, we had six principles that influenced our architecture and approach:
- Real-time results. Social networks and email operate as near-interactive, real-time services. Thus, significant delays in filtering decisions degrade the protected service.
- Readily scalable to required throughput. We aim to provide viable classification for services such as Twitter that receive over 15 million URLs a day.
- Accurate decisions. We want the capability to emphasize low false positives in order to minimize mistaking non-spam URLs as spam.
- Fine-grained classification. The system should be capable of distinguishing between spam hosted on public services alongside non-spam content (i.e., classification of individual URLs rather than coarser-grained domain names).
- Tolerant to feature evolution. The arms-race nature of spam leads to ongoing innovation on the part of spammers’ efforts to evade detection. Thus, we require the ability to easily retrain to adapt to new features.
- Context-independent classification. If possible, decisions should not hinge on features specific to a particular service, allowing use of the classifier for different types of web services.
The architecture for Monarch consists of four components. First, messages from web services (tweets and emails in our prototype) are inserted into a dispatch Kestrel queue in a phase called URL Aggregation. These are then dequeued for Feature Collection, where a cluster of EC2 machines crawls each URL to fetch the HTML content, resolve all redirects, monitor all IP addresses contacted, and perform a number of host lookups and geolocation resolution. We optimize feature collection to include caching and whitelisting of popular benign content. These features are then stored in a database, which is later used during Feature Extraction to transform the data into meaningful binary vectors. These are then supplied to Classification. We obtain a labeled dataset from email spam traps as well as blacklists (our only means of obtaining a ground truth set of spam on Twitter). Using a distributed logistic regression with L1-regularization, which we detail in the paper, we are able to reduce from 50 million features down to 100,000 of the most meaningful features and build a model of spam in 45 minutes for 1 million samples. During live operation, we simply use this model to classify the features of a URL. Overall, it takes roughly 6 seconds from insertion into the dispatch queue to obtain a final decision for whether a URL is spam, with network delay accounting for the majority of overhead.
- Training on both email and tweets, we are able to generate a unified model that correctly classifies 91% of samples, with 0.87% false positives and 17.6% false negatives.
- Throughput of the system is 638,000 URLs/day when running on 20 EC2 instances.
- Decision time for a single URL is ~6 seconds
One of the unexpected results is that Twitter spam appears to be independent from email spam, with different campaigns occurring in both services simultaneously. This seems to indicate the actors targeting email haven’t modified their infrastructure to attack Twitter yet, though this may change over time.
There remain a number of challenges in running a system like Monarch that are discussed in the paper as well as pointed out by other researchers.
- Feature Evasion: Spammers can attempt to game the machine learning system. Given the real-time feedback for whether a URL is spam, they can attempt to modify their content or hosting to avoid detection.
- Time-based Evasion: URLs are crawled immediately upon their submission to the dispatch queue. This creates a time of click, time of use challenge where spammers can present benign content upon sending an email/tweet, but then change the content to spam after the URL is cleared.
- Crawler Evasion: Given we operate on a limited IP space and use single browser type, attackers can fingerprint both our hosting and browser client. They can then redirect our crawlers to benign content, while sending legitimate visitors to hostile content.
- Side effects: Not all websites adhere to the standard that GET requests should have no side effects. In particular, subscribe and unsubscribe URLs as well as advertisements may have side effects introduced by our crawler.
Other interesting questions also remain to be answered. In particular, it would be useful to understand how accuracy performs over time on a per campaign basis. Some campaigns may last a long time, increasing our overall accuracy, while quickly churning campaigns that introduce new features may result in lower accuracy. Similarly, it would be useful to understand whether the features we identify appear in all campaigns (and are long lasting), or whether we are able to quickly adapt to the introduction of new features and new campaigns.