Photo StateImpact NPR

Building a Spam Filter with Naive Bayes

Photo StateImpact NPR

Building a Spam Filter with Naive Bayes

In this project, we’re going to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. Our goal is to write a program that classifies new messages with an accuracy greater than 80% — so we expect that more than 80% of the new messages will be classified correctly as spam or ham (non-spam).

To train the algorithm, we’ll use a dataset of 5,572 SMS messages that are already classified by humans. The dataset was put together by Tiago A. Almeida and José María Gómez Hidalgo, and it can be downloaded from the The UCI Machine Learning Repository. The data collection process is described in more details on this page, where you can also find some of the papers authored by Tiago A. Almeida and José María Gómez Hidalgo.

Exploring the Dataset

We’ll now start by reading in the dataset.

import pandas as pd

spam_dat = pd.read_csv('SMSSpamCollection', sep = '\t', header = None, names = ['Label', 'SMS'])
spam_dat.head()
Label SMS
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
print(spam_dat.Label.value_counts(normalize = True) * 100)
print(spam_dat.shape)
ham     86.593683
spam    13.406317
Name: Label, dtype: float64
(5572, 2)

Above, we see that about 87% of the messages are ham, and the remaining 13% are spam. This sample looks representative, since in practice most messages that people receive are ham.

Training and Test Set

We’re now going to split our dataset into a training and a test set, where the training set accounts for 80% of the data, and the test set for the remaining 20%.

rand_spam = spam_dat.sample(frac = 1, random_state =1)
split_index = round(0.8 * len(rand_spam))

train = rand_spam.iloc[:split_index,:].copy()
test = rand_spam.iloc[split_index:,:].copy()

print('-----------------------------------------')
print('Train')
print('-----------------------------------------')
print(train.Label.value_counts(normalize = True) * 100)
print('-----------------------------------------')
print('Test')
print('-----------------------------------------')
print(test.Label.value_counts(normalize = True) * 100)
-----------------------------------------
Train
-----------------------------------------
ham     86.54105
spam    13.45895
Name: Label, dtype: float64
-----------------------------------------
Test
-----------------------------------------
ham     86.804309
spam    13.195691
Name: Label, dtype: float64

We analyzed the percentage of spam and ham messages in the training and test sets. We saw that the percentages were close to what we have in the full dataset, where about 87% of the messages are ham, and the remaining 13% are spam.

Data Cleaning

To calculate all the probabilities required by the algorithm, we’ll first need to perform a bit of data cleaning to bring the data in a format that will allow us to extract easily all the information we need.

Essentially, we want to bring data to this format:

Letter Case and Punctuation

We’ll begin with removing all the punctuation and bringing every letter to lower case.

train['SMS'] = train['SMS'].str.replace('\W', ' ')
train['SMS'] = train['SMS'].str.lower()
train.head()
Label SMS
1078 ham yep by the pretty sculpture
4028 ham yes princess are you going to make me moan
958 ham welp apparently he retired
4642 ham havent
4674 ham i forgot 2 ask ü all smth there s a card on ...

Creating the Vocabulary

Let’s now move to creating the vocabulary, which in this context means a list with all the unique words in our training set.

train['SMS'] = train['SMS'].str.split()

vocabulary = []

for sentence in train['SMS']:
    for word in sentence:
        vocabulary.append(word)

len(vocabulary)
72427
vocabulary = list(set(vocabulary))
len(vocabulary)
7783

It looks like there are 7,783 unique words in all the messages of our training set.

The Final Training Set

We’re now going to use the vocabulary we just created to make the data transformation we want.

word_counts_per_sms = {unique_word: [0] * len(train['SMS']) for unique_word in vocabulary}

for index, sms in enumerate(train['SMS']):
    for word in sms:
        word_counts_per_sms[word][index] += 1
word_counts = pd.DataFrame(word_counts_per_sms)
word_counts.head()
0 00 000 000pes 008704050406 0089 01223585334 02 0207 02072069400 ... zindgi zoe zogtorius zouk zyada é ú1 ü 〨ud
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 2 0 0

5 rows × 7783 columns

train_word = pd.concat([train, word_counts], axis = 1)
train_word.head()
Label SMS 0 00 000 000pes 008704050406 0089 01223585334 02 ... zindgi zoe zogtorius zouk zyada é ú1 ü 〨ud
0 ham [go, until, jurong, point, crazy, available, o... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 ham [ok, lar, joking, wif, u, oni] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 NaN NaN 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 ham [u, dun, say, so, early, hor, u, c, already, t... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 ham [nah, i, don, t, think, he, goes, to, usf, he,... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0

5 rows × 7785 columns

train_word = train_word[pd.notnull(train_word['Label'])]
train_word.head()
Label SMS 0 00 000 000pes 008704050406 0089 01223585334 02 ... zindgi zoe zogtorius zouk zyada é ú1 ü 〨ud
0 ham [go, until, jurong, point, crazy, available, o... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 ham [ok, lar, joking, wif, u, oni] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 ham [u, dun, say, so, early, hor, u, c, already, t... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 ham [nah, i, don, t, think, he, goes, to, usf, he,... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0
5 spam [freemsg, hey, there, darling, it, s, been, 3,... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 7785 columns

Calculating Constants First

We’re now done with cleaning the training set, and we can begin creating the spam filter. The Naive Bayes algorithm will need to answer these two probability questions to be able to classify new messages:

equation

equation

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we’ll need to use these equations: equation

equation

Some of the terms in the four equations above will have the same value for every new message. We can calculate the value of these terms once and avoid doing the computations again when a new messages comes in. Below, we’ll use our training set to calculate:

  • P(Spam) and P(Ham)
  • NSpam, NHam, NVocabulary

We’ll also use Laplace smoothing and set $\alpha = 1$.

# Isolating spam and ham messages first
spam_messages = train_word[train_word['Label'] == 'spam']
ham_messages = train_word[train_word['Label'] == 'ham']

# P(Spam) and P(Ham)
p_spam = len(spam_messages) / len(train_word)
p_ham = len(ham_messages) / len(train_word)

# N_Spam
n_words_per_spam_message = spam_messages['SMS'].apply(len)
n_spam = n_words_per_spam_message.sum()

# N_Ham
n_words_per_ham_message = ham_messages['SMS'].apply(len)
n_ham = n_words_per_ham_message.sum()

# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

Calculating Parameters

Now that we have the constant terms calculated above, we can move on with calculating the parameters $P(w_i|Spam)$ and $P(w_i|Ham)$. Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

The parameters are calculated using the formulas: equation

equation

ham_dict = {unique_words: 0 for unique_words in vocabulary}
spam_dict = {unique_words: 0 for unique_words in vocabulary}

for word in vocabulary:
    n_word_given_spam = spam_messages[word].sum()   # spam_messages already defined in a cell above
    p_word_given_spam = (n_word_given_spam + alpha) / (n_spam + alpha*n_vocabulary)
    spam_dict[word] = p_word_given_spam
    
    n_word_given_ham = ham_messages[word].sum()   # ham_messages already defined in a cell above
    p_word_given_ham = (n_word_given_ham + alpha) / (n_ham + alpha*n_vocabulary)
    ham_dict[word] = p_word_given_ham

Classifying A New Message

Now that we have all our parameters calculated, we can start creating the spam filter. The spam filter can be understood as a function that:

  • Takes in as input a new message (w1, w2, …, wn).
  • Calculates P(Spam|w1, w2, …, wn) and P(Ham|w1, w2, …, wn).
  • Compares the values of P(Spam|w1, w2, …, wn) and P(Ham|w1, w2, …, wn), and:

    • If P(Ham|w1, w2, …, wn) > P(Spam|w1, w2, …, wn), then the message is classified as ham.
    • If P(Ham|w1, w2, …, wn) < P(Spam|w1, w2, …, wn), then the message is classified as spam.
    • If P(Ham|w1, w2, …, wn) = P(Spam|w1, w2, …, wn), then the algorithm may request human help.
import re

def classify(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham
    
    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]
            
        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]
            

    print('P(Spam|message):', p_spam_given_message)
    print('P(Ham|message):', p_ham_given_message)

    if p_ham_given_message > p_spam_given_message:
        print('Label: Ham')
    elif p_ham_given_message < p_spam_given_message:
        print('Label: Spam')
    else:
        print('Equal proabilities, have a human classify this!')
test_1 = 'WINNER!! This is the secret code to unlock the money: C3421.'
classify(test_1)
P(Spam|message): 4.9813712674139326e-29
P(Ham|message): 1.538857970697398e-25
Label: Ham
test_2 = "Sounds good, Tom, then see u there"
classify(test_2)
P(Spam|message): 2.9292764222612153e-24
P(Ham|message): 5.136710135587318e-22
Label: Ham

Measuring the Spam Filter’s Accuracy

The two results above look promising, but let’s see how well the filter does on our test set, which has 1,114 messages.

We’ll start by writing a function that returns classification labels instead of printing them.

def classify_test_set(message):

    message = re.sub('\W', ' ', message)
    message = message.lower()
    message = message.split()

    p_spam_given_message = p_spam
    p_ham_given_message = p_ham

    for word in message:
        if word in spam_dict:
            p_spam_given_message *= spam_dict[word]

        if word in ham_dict:
            p_ham_given_message *= ham_dict[word]

    if p_ham_given_message > p_spam_given_message:
        return 'ham'
    elif p_spam_given_message > p_ham_given_message:
        return 'spam'
    else:
        return 'needs human classification'

Now that we have a function that returns labels instead of printing them, we can use it to create a new column in our test set.

test['predicted'] = test['SMS'].apply(classify_test_set)
test.head()
Label SMS predicted
2131 ham Later i guess. I needa do mcat study too. ham
3418 ham But i haf enuff space got like 4 mb... ham
3424 spam Had your mobile 10 mths? Update to latest Oran... ham
1538 ham All sounds good. Fingers . Makes it difficult ... ham
5393 ham All done, all handed in. Don't know if mega sh... ham
correct = 0
total = len(test)

for row in test.iterrows():
    row = row[1]
    if row['Label'] == row['predicted']:
        correct += 1

print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct/total)
Correct: 965
Incorrect: 149
Accuracy: 0.8662477558348295

The accuracy is close to 86.6%, which is really good. Our spam filter looked at 1,114 messages that it hasn’t seen in training, and classified 958 correctly.

Conclusion

In this project, we managed to build a spam filter for SMS messages using the multinomial Naive Bayes algorithm. The filter had an accuracy of 86.6% on the test set we used, which is a pretty good result. Our initial goal was an accuracy of over 80%, and we managed to do little better than that.

Next steps include:

  • Analyze the 14 messages that were classified incorrectly and try to figure out why the algorithm classified them incorrectly
  • Make the filtering process more complex by making the algorithm sensitive to letter case
Avatar
Amol Kulkarni
Ph.D.

My research interests include application of Machine learning algorithms to the fields of Marketing and Supply Chain Engineering, Decision Theory and Process Optimization.