Photo on SoftwareAdvice.com

TURNING TWEETS INTO KNOWLEDGE

Photo on SoftwareAdvice.com

TURNING TWEETS INTO KNOWLEDGE

Sentiment analysis is about detecting emotions, opinions of people about certain topics by analyzing their texts from tweets, fb comments or status , youtube comments so on and so forth. Retail industries and companies in general use sentiment analysis to get an overview of their clients’ opinions on their products which enables them to make improvements and certain modifications to their products so that it meets their clients’ standards. There are a lot of social media sites like Google Plus, Facebook, and Twitter that allow expressing opinions, views, and emotions about certain topics and events. Twitter data is a rich source of information on a large set of topics. This data can be used to find trends related to a specific keyword, measure brand sentiment or gather feedback about new products and services.

Many companies maintain an online presence to manage their public perception and interact with its followers or fans. Apple is a company know for its laptops, cellphones, tablets, and personal media players. As such, it enjoys a large following consisting of people who both “love” and “hate” the company. The objective of this project is to discover the public perception of Apple. The greatest challenge of this project was dealing with Tweets as they are - Loosely structured - Textual - Poor spelling, non-traditional grammar - Multilingual - Contain emojis

The tweets were collected using the twitter API. The tweets were labled by humans with the help of Amazon Mechanical Turk. Five people contributed to the classification of thousands of tweets. Following options were provided for them to select on: - Strongly Negative - Negative - Neutral - Positive - Strongly positive

For each tweet the average of five scores is considered. The data consists of 2 variables: - Tweet containing textual data - Avg contains the sentiment score for each tweet

Tweet Avg
I have to say, Apple has by far the best customer care service I have ever received! @Apple @AppStore 2
iOS 7 is so fricking smooth & beautiful!! #ThanxApple @Apple 2
LOVE U @APPLE 1.8
Thank you @apple, loving my new iPhone 5S!!!!!  #apple #iphone5S pic.twitter.com/XmHJCU4pcb 1.8
.@apple has the best customer service. In and out with a new phone in under 10min! 1.8
@apple ear pods are AMAZING! Best sound from in-ear headphones I've ever had! 1.8

A boolean dependent variable is created by considering the average label with a value less than 0 as negative and anything else as positive.


# Read in the data
tweets = read.csv("tweets.csv", stringsAsFactors=FALSE)


# Create dependent variable
tweets$Negative = as.factor(tweets$Avg <= -1)

Cleaning Up Irregularities

Now that we have our labels, the next step is building independent variables from the text of a tweet. It is difficult to completely understand the text of a tweet, a simpler approach is to count the number of times a each words appear. Consider the sentence “This show is great. I would recommend this show to my friends.”

This show is great I would recommend to my friends
2 2 1 1 1 1 1 1 1 1

Creating a feature for each word is a simple yet effective approach. It is used as a baseline in most text analytics projects and natural language processing. Text data often has many inconsistencies that will cause algorithms trouble. Preprocessing this data can dramtically improve performance! Computers are very literal by default – Apple, APPLE, and ApPLe will all be counted separately. The solution is to convert all the letters to lower case.

library(tm)
library(SnowballC)

# Create corpus
corpus = VCorpus(VectorSource(tweets$Tweet)) 

# Convert to lower-case
corpus = tm_map(corpus, content_transformer(tolower))
corpus[[1]]$content
[1] "i have to say, apple has by far the best customer care service i have ever received! @apple @appstore"

Punctuation also cause problems – basic approach is to remove everything that isn’t an alphabet. This not always true sometimes punctuation is meaningful for instance:

  • Twitter: @apple is a message to Apple, #apple is about Apple
  • Web addresses: www.website.com/somepage.html

This approach should be tailored to specific problems.

# Remove punctuation
corpus = tm_map(corpus, removePunctuation)

corpus[[1]]$content

[1] "i have to say apple has by far the best customer care service i have ever received apple appstore"

Removing Unhelpful Terms

Many words such as the, is, at, which… are frequently used but are only meaningful in a sentence are termed as “stop words”. They are unlikely to improve machine learning prediction quality. Therefore they should be removed to reduce size of data. The “tm” package has a stored list of stop words. Let’s have a look at the stop words

# Look at stop words 
stopwords("english")[1:10]

[1] "i"         "me"        "my"        "myself"    "we"        "our"       "ours"      "ourselves" "you"       "your"

Removing these common words will not affect the performance of the machine learning model.

# Remove stopwords and apple
corpus = tm_map(corpus, removeWords, c("apple", stopwords("english")))

corpus[[1]]$content

[1] "   say    far  best customer care service   ever received! appstore"

Stemming

Stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form. Consider words such as - argue
- argued
- argues
- arguing All the words could be represented by a common stem, argu. Stemming is an important part of the pipelining process in Natural language processing.

Errors in Stemming:

There are mainly two errors in stemming – - over-stemming: occurs when two words are stemmed from the same root that are of different stems. Over-stemming can also be regarded as false-positives. - under-stemming: occurs when two words are stemmed from the same root that are not of different stems. Under-stemming can be interpreted as false-negatives.

Applications of stemming :

Stemming is used in information retrieval systems like search engines. It is used to determine domain vocabularies in domain analysis.Fun Fact: Google search adopted a word stemming in 2003. Previously a search for “fish” would not have returned “fishing” or “fishes”. There are various stemming algorithms, however “Porter Stemmer”developed by Martin Porter in 1980 is still used!

Porter’s Stemmer algorithm

It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. This stemmer is known for its speed and simplicity. The main applications of Porter Stemmer include data mining and Information retrieval. However, its applications are only limited to English words. Also, the group of stems is mapped on to the same stem and the output stem is not necessarily a meaningful word. The algorithms are fairly lengthy in nature and are known to be the oldest stemmer.

  • Advantage: It produces the best output as compared to other stemmers and it has less error rate.
  • Limitation: Morphological variants produced are not always real words.
# Stem document 

corpus = tm_map(corpus, stemDocument)

corpus[[1]]$content

[1] "say far best custom care servic ever receiv appstor"

The next step in the process is to create document-term matrix or a term-document matrix. A term-document matrix is an important representation for text analytics. Each row of the matrix is a document vector, with one column for every term in the entire corpus. The value in each cell of the matrix is the term frequency. This value is often a weighted term frequency, typically using tf-idf – term frequency-inverse document frequency.

# Create matrix

frequencies = DocumentTermMatrix(corpus)

frequencies

<<DocumentTermMatrix (documents: 1181, terms: 4166)>>
Non-/sparse entries: 9110/4910936
Sparsity           : 100%
Maximal term length: 123
Weighting          : term frequency (tf)

Let’s look at the matrix

inspect(frequencies[1000:1005,505:515])
Terms
Document cheapen cheeper check cheep cheer
cheerio
cherylcol
chief
chiiiqu
child
1000 0 0 0 0 0 0 0 0 0 0
1001 0 0 0 0 0 0 0 0 0 0
1002 0 0 0 0 0 0 0 0 0 0
1003 0 0 0 0 0 0 0 0 0 0
1004 0 0 0 0 0 0 0 0 0 0
1005 0 0 0 0 0 0 0 0 0 0

Lets look at the frequently occurring terms in the document-term matrix.

findFreqTerms(frequencies, lowfreq=20)

[1] "android"              "anyon"                "app"                  "appl"                 "back"                 "batteri"             
[7] "better"               "buy"                  "can"                  "cant"                 "come"                 "dont"                
[13] "fingerprint"          "freak"                "get"                  "googl"                "ios7"                 "ipad"                
[19] "iphon"                "iphone5"              "iphone5c"             "ipod"                 "ipodplayerpromo"      "itun"                
[25] "just"                 "like"                 "lol"                  "look"                 "love"                 "make"                
[31] "market"               "microsoft"            "need"                 "new"                  "now"                  "one"                 
[37] "phone"                "pleas"                "promo"                "promoipodplayerpromo" "realli"               "releas"              
[43] "samsung"              "say"                  "store"                "thank"                "think"                "time"                
[49] "twitter"              "updat"                "use"                  "via"                  "want"                 "well"                
[55] "will"                 "work"                                  

The next step is to remove the sparse terms

sparse = removeSparseTerms(frequencies, 0.995)
sparse

<<DocumentTermMatrix (documents: 1181, terms: 266)>>
Non-/sparse entries: 3782/310364
Sparsity           : 99%
Maximal term length: 22
Weighting          : term frequency (tf)

We can now convert the matrix into a dataframe named tweetSparse. We make all the variables R-friendly and then and the dependent variable.

tweetsSparse = as.data.frame(as.matrix(sparse))

colnames(tweetsSparse) = make.names(colnames(tweetsSparse))

tweetsSparse$Negative = tweets$Negative

We then split the data into train and test set.

# Split the data
library(caTools)
set.seed(123)

split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)

trainSparse = subset(tweetsSparse, split==TRUE)
testSparse = subset(tweetsSparse, split==FALSE)

Now we can train a decision tree using the cart library and plot it using rpart.plot library.

# Build a CART model
library(rpart)
library(rpart.plot)

tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")

prp(tweetCART)

# Evaluate the performance of the model
predictCART = predict(tweetCART, newdata=testSparse, type="class")

table(testSparse$Negative, predictCART)
predictCART
FALSE
TRUE
FALSE 294
6
TRUE 37 18

The decision tree is 87.88% accurate. We can build a random forest model to see if that would improve the accuracy of the classification.

# Random forest model
library(randomForest)
set.seed(123)

tweetRF = randomForest(Negative ~ ., data=trainSparse)

# Make predictions:
predictRF = predict(tweetRF, newdata=testSparse)

table(testSparse$Negative, predictRF)
predictCART
FALSE
TRUE
FALSE 293 7
TRUE 34 21

We can see that the accuracy of the model on the test data is 88.4%. This is not a significant improvement. However, the accuracy can be further improved by tuning the hyper parameters.

Avatar
Amol Kulkarni
Ph.D.

My research interests include application of Machine learning algorithms to the fields of Marketing and Supply Chain Engineering, Decision Theory and Process Optimization.