Sentiment Analysis – NoSimpler

Twitter Sentiment Analysis

Twitter sentiment analysis is done to determine, from tweets, whether people are talking positively or negatively about the topic. Words in the tweet are assigned positive/ negative scores based on their occurrence in the list of words indicating positive/negative sentiment. If positive score is higher tweet is considered to show a positive sentiment, otherwise, negative.

sa-hillary-donald-india

sa-tweets-apple-macbookpro

sa-donald-hillary-freq-words

Programming Logic

Steps for Twitter Sentiment Analysis

Pre-requisites :

Text files containing words indicating positive or negative sentiment. In this example I'm using files used for sentiment analysis of topic 'abortion' just for sake of illustration; ensure that you use the words pertaining to your analysis topic.

Step 1:
Install the required R packages and load them

Step 2:
Set up the environment options, if any

Step 3:
Connect to Twitter using the OAuth credentials

Step 4:
Read the desired number of tweets (max. ~ 1500) for the given topic or search terms

Step 5:
Convert the tweets into text

Step 6:
For each tweet, clean the tweet, split the tweet into words

Step 4:
For each tweet, match each word with positive/negative words and assign scores, final score is difference between positive score and negative scores

Step 5:
Plot the frequency table for score of words

Step 6:
Plot Histogram for the score of words

Step 7:
Analyse the results

OAuth with Twitter

Save the OAuth credentials in a text file with Tab delimited column names as header. Path of file should be same that of R code.

OAuth your R application with Twitter for reading in tweets

See video for OAuth starting @2:16:00

By Jalayer Academy

Install required R packages

install.packages("twitteR", dependencies=TRUE)
install.packages("RCurl")
install.packages('bitops')
install.packages('base64enc')
install.packages('httpuv')
install.packages('tm')
install.packages('wordcloud')
install.packages("stringr")

Load the R packages

libs = c("twitteR", "RCurl", "tm", "stringr", "wordcloud")
lapply(libs, require, character.only=TRUE)

Set Environment option

To Not convert character variables to factor variables

options(stringsAsFactors = FALSE)

function 1 : doOAuth

Read oauth credentials from file and connect to Twitter

Input - filepath, filename

doOAuth = function(path, filename){

file = paste(path,filename,sep='/')
oauthCreds = read.table(file,header=T)
setup_twitter_oauth(oauthCreds$consumer_key,
oauthCreds$consumer_secret,
oauthCreds$access_token,
oauthCreds$access_secret)

}

function 2 : getTweets

Read recent tweets for search terms or topic
Input - search terms, number of tweets
Output - list containing tweets

getTweets = function(searchTerms, numberOfTweets){

tweets_list = searchTwitter(searchTerms,lang="en",n=numberOfTweets,resultType="recent")
#length(tweets_list)
class(tweets_list)
return(tweets_list)

}

function 3 : getTweets_text

Retrieve the text part from tweets in list

Input - list containing tweets

Output - character vector

getTweets_text = function(tweets_list){

tweets_text = sapply(tweets_list, function(x) x$getText())
#str(tweets_text)
#class(tweets_text)
return (tweets_text)

}

function 4: get_tweet_words

Clean up tweets using regex global substitution function gsub() in R to remove punctuation, controls, digits; make lowercase and split into words

Input - character vector of tweets

Output - list of tweet words

get_tweet_words = function(tweet_text){

tweet_text_clean = gsub('[[:punct:]]','', tweet_text)
tweet_text_clean = gsub('[[:cntrl:]]','', tweet_text_clean)
tweet_text_clean = gsub('\\d+','', tweet_text_clean)
tweet_text_clean = tolower(tweet_text_clean)
tweet_words_list = str_split(tweet_text_clean, '\\s+')

# sometimes a list() is one level of hierarchy too much
tweet_words = unlist(tweet_words_list)
return(tweet_words)

}

function 5: get_score_for_tweet

Match each word with positive/negative words, if match is found then the position of the word in the list of positive/negative word is assigned otherwise NA is assigned

Get the scores by discarding NA values; final score for a tweet is difference between positive and negative scores

get_score_for_tweet = function(tweet_words, pos.words, neg.words){

pos.matches = match(tweet_words, pos.words)
neg.matches = match(tweet_words, neg.words)
#match() returns the position of the matched term or NA
#we just want not na, so for all values with position it will return true
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
score = sum(pos.matches)-sum(neg.matches)
return(score)

}

Frequency table for scores

Call the twitter mining functions per the programming logic to assign the scores to words in tweets and create the frequency table for all words in all tweets

# oauth with twitter

doOAuth("<C:/...>","<twitter_Oauth_credentials.txt>")

# set seed so that result is same

set.seed(1234)

# read in positive words

pos = scan('<C:/~/positive-words.txt',
what='character',
comment.char=';')
# Read 2006 items - 2006 words, class character
class(pos)
#[1] "character"
str(pos)
# chr [1:2006] "a+" "abound" "abounds" "abundance" "abundant" ...

# read in negative words

neg = scan('<C:/~/negative-words.txt>',
what='character', comment.char=';')
# Read 4783 items - 4783 words, class character

# read in tweets for the search terms

searchTerms = c('apple macbook pro')
numberOfTweets = 600
tweets_list = getTweets(searchTerms, numberOfTweets)
length(tweets_list)
tweets_text = getTweets_text(tweets_list)

# clean tweets and split in to words

tweets_words = lapply(tweets_text, get_tweet_words)
# returns a list of 600 items - each corresponding to one tweet
# each item is character vector or words in each tweet
class(tweets_words)
#[1] "list"
str(tweets_words)
# List of 600
# $ : chr [1:13] "see" "the" "detailed" "look" ...
# $ : chr [1:17] "rt" "testervnr" "although" "the" ...
# $ : chr [1:15] "samsung" "supplying" "oled" "panel" ...
# $ : chr [1:11] "rt" "cnet" "how" "apple" ...
# $ : chr [1:10] "razer" "mocks" "apple" "macbook" ...
# $ : chr [1:13] "apple" "<U+2033>""| __truncated__ "macbook" "pro" ...
# $ : chr [1:11] "rt" "cnet" "how" "apple" ...
# and so on ...

#now get the score for each
# input list 'l' , output array 'a', use laply

require(plyr)
scores = laply(tweets_words,get_score_for_tweet,pos,neg)
class(scores)
#[1] "integer"
str(scores)
#int [1:600] 0 0 0 0 -1 0 0 -1 1 0 ...

# create score frequency table

analysis = data.frame(score=scores, tweet=tweets_text)
class(analysis)

table(analysis$score)
# -2 -1 0 1 2 3
# 88 139 279 78 14 2

Plot Histogram for Analysis

Start sentiment analysis using the scores for the words in tweets and plot histogram and normal distribution line on histrogram

# start analysis

attach(analysis)
summary(score)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# -2.0000 -1.0000 0.0000 -0.3383 0.0000 3.0000

# plot histogram

bins = seq(min(score), max(score), 1)
bins
# [1] -2 -1 0 1 2 3

h = hist(analysis$score, breaks=bins,
main='sentiment analysis of tweets',
ylab='frequency of scores',
xlab='scores',
col='grey')

# to plot normal distribution line on histogram

# create 6 bins from our data

xfit = seq(min(score), max(score), length=6)

# given our datas mean and sd, find the normal distribution

yfit = dnorm(xfit, mean=mean(score),,sd=sd(score))

# fit the normal dististribution to our data

yfit = yfit*diff(hh$mids[1:2])*length(score)

#plot these lines

lines(xfit,yfit)

Inferences

From the bins we can infer that the inclination is more towards negative

sa-tweets-apple-macbookpro