Email spam has appeared since the beginning of the internet. Due to its extreme low cost, email spam has been growing drastically. In 2011, approximately 7,000 billion spam emails were sent worldwide, accounting for 85% of the global email activity. According to the statistics from the Nucleus Research Inc., spam management cost U.S. businesses an estimated USD 71 billion in lost productivity. Thanks to the development of reliable spam filtering systems, we are prevented from being submerged in spam emails.
Spam filtering approaches fall into two broad categories: one is by using computer techniques to detect spam and the other is by setting up a spam filtering statistical model*. For now, let’s focus on the latter approach. First, collect a sizable amount of normal emails and spam emails, identify the features of these two types of emails using text mining, e.g. frequencies of occurrence of specified words and symbols, ratio of symbols to words, length of sentence, upper and lower case letters (e.g. in English), etc., and then set up a model based on these features, e.g. if the word qian “錢” occurs up to a certain number of times, its spam probability increases. To determine whether or not an incoming email is a spam, the email is scored using this model by comparing its calculated probabilities with the default values.