Spam Filtering in R

Spam filtering involves predicting whether a message is unwanted spam (e.g. unwanted promotions) or a message that is of interest (for analysis referred to as ‘ham’). These are most useful in email systems where spam messages are automatically dumped into a spam folder. The procedure to do this can however be applied to other areas. For example, Apple Inc. could use filtering to analyse tweets. It could determine whether a tweet with ‘apple’ in it, is about their company and therefore of interest to them, or about the apple fruit (as in apples and pears), and therefore not of interest to them. The probability that a message with the word ‘apple’ in it, refers to Apple Inc., is higher if it also includes words like ‘iPhone’ and ‘computer’, than if it includes words like ‘strawberries’ and ‘farmers’. Likewise, the probablity that a message is spam is higher if it includes certain words. These words are determined by analysing a dataset of messages that have been already classified into spam or not spam (ham).

Naive Bayes

Naive Bayes calculates the probability that a message is spam given that it includes certain words. It does this by assuming independence of events i.e. the probablity of word A occuring in a spam message is independent to the probabililty of word B occuring. This is an oversimplification and is why this classifier is ‘naive’. In reality, the probablity of Word A occuring in a spam message may change if we already knew word B occurs in that message. Accounting for this would be computational expensive. Naive Bayes performs well despite it’s simplicity. It is therefore often used for a quick text classification model.

SMS Spam Filtering

This page demonstrates an example of spam filtering using Naive Bayes in R. The dataset includes 5,559 SMS messages and can be accessed here. The procedure follows the example given in Machine Learning with R by Brett Lantz.

Installing, Loading and Reading

It is first neccessary to install and load the necessary packages. The ‘wordcloud’ package is used to display the most frequent words in each class in a nice visualisation. The ‘tm’ package with ‘SnowballC’ is used to clean and organise the data. ‘e1071’ is used for the Naive Bayes classifier, and ‘gmodels’ is used to display a confusion matrix for accuracy statistics.

#install.packages('tm')
#install.packages('wordcloud')
#install.packages('e1071')
#install.packages('gmodels')
#install.packages('SnowballC')

library(tm)
library(wordcloud)
library(e1071)
library(gmodels)
library(SnowballC)

The csv is read into R (if in correct directory). The type refers to whether the message is spam or ham. This must be converted to a factor so it is seen as a categorical variable. The table shows the amount of ham or spam messages in the dataset.

spam <- read.csv('sms_spam.csv')
spam$type <- factor(spam$type)
table(spam$type)

## 
##  ham spam 
## 4827  747

We can visualise the frequent words in each category using a wordcloud. Words such as ‘free’, ‘call’, ‘now’, appear often in the spam category.

spam_messages <- subset(spam,type=="spam")
ham_messages <- subset(spam, type=="ham")
wordcloud(spam_messages$text, max.words = 100, scale = c(3,0.5))

wordcloud(ham_messages$text, max.words = 100, scale = c(3,0.5))

Data Preparation

In preparation for statistical modelling the dataset needs to be cleaned. The ‘tm’ package can do this by converting the messages into a collection of text documents known as a ‘corpus’. This is then used to create a ‘document term matrix’ in which rows indicate each message and the columns indicate each word. Words are converted to lower case, numbers and punctuation are removed. Stemming is also perfomed. This removes the suffix from words, making it easier for analysis as it combines words with similar meanings. ‘Calling’, ‘Calls’, ‘Called’, for example, would be converted to ‘Call’.

corpus <- VCorpus(VectorSource(spam$text)) 
dtm <- DocumentTermMatrix(corpus, control = list(
  tolower = TRUE,
  removeNumbers = TRUE,
  removePunctuation = TRUE,
  stemming = TRUE
))

Data Partitioning and Further Cleaning

The dataset is split for training and testing. 75% of the dataset is used to train and build the model which is then tested on 25% of the dataset. The proportions are checked to ensure equal proportions of each category for both datasets. Around 13% of the messages in both training and test datasets are spam. The partitioning is then performed.

trainLabels <-spam[1:4169,]$type
testLabels <- spam[4170:5559,]$type
prop.table(table(trainLabels))

## trainLabels
##       ham      spam 
## 0.8647158 0.1352842

prop.table(table(testLabels))

## testLabels
##       ham      spam 
## 0.8697842 0.1302158

dtmTrain <- dtm[1:4169,]
dtmTest <- dtm[4170:5559,]

Low frequency words are removed as they are unlikely to be useful in the model. Only words that occur more than 5 times are used.

freqWords <- findFreqTerms(dtmTrain,5)
freqTrain <- dtmTrain[,freqWords]
freqTest <- dtmTest[,freqWords]

The DTM matrix uses 1’s or 0’s depending on whether the word occurs in the sentence or not. Naive Bayes classifier works with categorical features. 1 and 0 is therefore converted to Yes or No. This is applied to every column (i.e. margin=2).

convert_counts <- function(x) {
  x <- ifelse(x > 0, "Yes", "No")
}
train <- apply(freqTrain, MARGIN = 2,
                   convert_counts)
test <- apply(freqTest, MARGIN = 2,
                    convert_counts)

The resulting matrix should look this this:

Training and Testing

The Naive Bayes classifier is created using the ‘e1071’ package. As an example, the output for the word ‘call’ is displayed. The probability that a message with the word ‘call’ is spam is higher than compared to ham.

classifier <- naiveBayes(train, trainLabels)
classifier[2]$tables$call

##            call
## trainLabels         No        Yes
##        ham  0.94368932 0.05631068
##        spam 0.56382979 0.43617021

To evaluate the performance of this classifier, it is used to predict the classification of messages in the test set. It’s accuracy can be seen in the confusion matrix. 9+23 = 32 out of 1390 were wrongly classified.

testPredict <- predict(classifier, test)
CrossTable(testPredict, testLabels,
           prop.chisq = FALSE, prop.t = FALSE,
           dnn = c('predicted', 'actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  1390 
## 
##  
##              | actual 
##    predicted |       ham |      spam | Row Total | 
## -------------|-----------|-----------|-----------|
##          ham |      1200 |        23 |      1223 | 
##              |     0.981 |     0.019 |     0.880 | 
##              |     0.993 |     0.127 |           | 
## -------------|-----------|-----------|-----------|
##         spam |         9 |       158 |       167 | 
##              |     0.054 |     0.946 |     0.120 | 
##              |     0.007 |     0.873 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      1209 |       181 |      1390 | 
##              |     0.870 |     0.130 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Final Words

32 out of 1390 wrong is good but may not be good enough for everyone. The odd message wrongly classified as spam may be very important. In these situations more sofisticated models are needed. The sender of the message, and whether the reciever reads similar messages, should also be included to determine spam. Such as simple model as shown is therefore not ideal for classifing important messages, but may be useful in less vital text classification. It may not be much of a problem, for example, if the odd tweet about apples and pears ends up in the Apple Inc. database.