brief lecture on text mining and social network analysis with r, by deolu adeleye

18
Text Mining, Social Network Analysis Deolu Adeleye Text Mining Just as we can mine raw materials from ores, we can also intelligently ‘mine’ textual data from groups of data. Once again, R proves to be a very powerful tool, with packages such as twitteR proving quite useful, as we’ll soon demonstrate. As a demonstration, we’ll be examining mining textual information from the popular social network Twitter. We’ll be examining tweets from the Twitter handle ‘@55wordsorless’ (though you could use any handle of your choice when running the code). Do note that these demonstrations will require an active internet connection (at least in the beginning to authenticate), and will be using the following R packages: twitter tm wordcloud SnowballC RWeka igraph The first step is to create a Twitter application for yourself. Go to https://twitter.com/apps/new and log in. After filling in the basic info, go to the “Settings” tab and select “Read, Write and Access direct messages”. Make sure to click on the save button after doing this. In the “Details” tab, take note of the following: • your consumer key • your consumer secret • your access token • your access secret Once these four are retrieved, simply insert them into the setup_twitter_oauth function in the format setup_twitter_oauth(“API key”, “API secret”, “Access token”, “Access secret”). Here’s ours with the according values inserted: #load the twitteR package library(twitteR) #authenticate setup_twitter_oauth(our_key, our_secret, our_token, our_access_secret) ## [1] "Using direct authentication" 1

Upload: deolu-adeleye

Post on 17-Jul-2015

184 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Text Mining, Social Network Analysis

Deolu Adeleye

Text Mining

Just as we can mine raw materials from ores, we can also intelligently ‘mine’ textual data from groups of data.Once again, R proves to be a very powerful tool, with packages such as twitteR proving quite useful, as we’ll soondemonstrate.

As a demonstration, we’ll be examining mining textual information from the popular social network Twitter. We’llbe examining tweets from the Twitter handle ‘@55wordsorless’ (though you could use any handle of your choice whenrunning the code).Do note that these demonstrations will require an active internet connection (at least in the beginning to authenticate),and will be using the following R packages:

• twitter• tm• wordcloud• SnowballC• RWeka• igraph

The first step is to create a Twitter application for yourself. Go to https://twitter.com/apps/new and log in. Afterfilling in the basic info, go to the “Settings” tab and select “Read, Write and Access direct messages”. Make sure toclick on the save button after doing this. In the “Details” tab, take note of the following:

• your consumer key• your consumer secret• your access token• your access secret

Once these four are retrieved, simply insert them into the setup_twitter_oauth function in the formatsetup_twitter_oauth(“API key”, “API secret”, “Access token”, “Access secret”). Here’s ours with the accordingvalues inserted:

#load the twitteR packagelibrary(twitteR)

#authenticatesetup_twitter_oauth(our_key,

our_secret,our_token,our_access_secret)

## [1] "Using direct authentication"

1

You only need to authenticate once per R session.

So, we’ve authenticated. Next, let’s just randomly mine a particular word, say ‘water’, from everywhere it was usedrecently on Twitter.

#retrieve last 50 tweets where hashtag '#water' is used, for examplewatertag<-searchTwitter('#water', n=50)head(watertag,3)

## [[1]]## [1] "FrozenMOVlE: #vsco #afterlight #winter #wisconsin #water #lake #michigan #frozen #milwaukee #city http://t.co/oIk004qj3O"#### [[2]]## [1] "FrozenMOVlE: Taking over my brothers place <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082> #frozen #water #sandwich #breakingout #ugly http://t.co/stgameCgt2"#### [[3]]## [1] "vikashprasad21: RT @WaterNetwork1: #DSRSD #Certified For #Water #Quality #Testing http://t.co/bkF2mz47qL"

Next, let’s get info from the particular user ‘@55wordsorless’:

#retrieve the last 100 tweets from the specified timelinetweets <- userTimeline('55wordsorless', n=100)head(tweets,3)

## [[1]]## [1] "55WordsOrLess: @_missjem_ Someone already has...though how you could borrow Dr. Who's TARDIS is your next difficulty... :|"#### [[2]]## [1] "55WordsOrLess: Join the conversation!! http://t.co/xqDAtWbzVK :D :D"#### [[3]]## [1] "55WordsOrLess: @BeautifulFeet_ Did you read the 'Mischievous' Thoughts as well? :D"

For our purposes, we’ll convert these into a data.frame object:

watertag_df <- twListToDF(watertag)tweets_df <- twListToDF(tweets)head(watertag_df,3)

## text## 1 #vsco #afterlight #winter #wisconsin #water #lake #michigan #frozen #milwaukee #city http://t.co/oIk004qj3O## 2 Taking over my brothers place <ed><U+00A0><U+00BD><ed><U+00B8><U+0082><ed><U+00A0><U+00BD><ed><U+00B8><U+0082> #frozen #water #sandwich #breakingout #ugly http://t.co/stgameCgt2## 3 RT @WaterNetwork1: #DSRSD #Certified For #Water #Quality #Testing http://t.co/bkF2mz47qL## favorited favoriteCount replyToSN created truncated## 1 FALSE 0 <NA> 2015-01-02 21:06:40 FALSE## 2 FALSE 0 <NA> 2015-01-02 21:06:21 FALSE## 3 FALSE 0 <NA> 2015-01-02 21:05:22 FALSE## replyToSID id replyToUID## 1 <NA> 551122427166875648 <NA>## 2 <NA> 551122347814825984 <NA>## 3 <NA> 551122099348054016 <NA>## statusSource

2

## 1 <a href="http://ifttt.com" rel="nofollow">IFTTT</a>## 2 <a href="http://ifttt.com" rel="nofollow">IFTTT</a>## 3 <a href="http://spinabell.com" rel="nofollow">spinabell</a>## screenName retweetCount isRetweet retweeted longitude latitude## 1 FrozenMOVlE 0 FALSE FALSE <NA> <NA>## 2 FrozenMOVlE 0 FALSE FALSE <NA> <NA>## 3 vikashprasad21 1 TRUE FALSE <NA> <NA>

head(tweets_df,3)

## text## 1 @_missjem_ Someone already has...though how you could borrow Dr. Who's TARDIS is your next difficulty... :|## 2 Join the conversation!! http://t.co/xqDAtWbzVK :D :D## 3 @BeautifulFeet_ Did you read the 'Mischievous' Thoughts as well? :D## favorited favoriteCount replyToSN created truncated## 1 FALSE 0 _MissJem_ 2014-11-28 18:03:43 FALSE## 2 FALSE 0 <NA> 2014-10-28 14:19:27 FALSE## 3 FALSE 0 BeautifulFeet_ 2014-10-09 19:48:46 FALSE## replyToSID id replyToUID## 1 538279840193839108 538392811075559424 434366153## 2 <NA> 527102349727531008 <NA>## 3 <NA> 520299852824334337 92370873## statusSource## 1 <a href="https://mobile.twitter.com" rel="nofollow">Mobile Web (M2)</a>## 2 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>## 3 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>## screenName retweetCount isRetweet retweeted longitude latitude## 1 55WordsOrLess 0 FALSE FALSE NA NA## 2 55WordsOrLess 0 FALSE FALSE NA NA## 3 55WordsOrLess 0 FALSE FALSE NA NA

After that, we’ll convert to a corpus (which is just a collection of text documents) using the tm package:

library(tm)#build a corpus, and specify the source to be character vectorswatertag_corpus <- Corpus(VectorSource(watertag_df$text))tweets_corpus <- Corpus(VectorSource(tweets_df$text))

The corpus allows us to perform certain manipulations with functions in the tm package. You should run ?Corpusto see other possible sources of textual data you can harness.

Let’s proceed by first ‘cleaning’ our data:

#make a copy, just in case we might need the original laterwatertag_1 <- watertag_corpustweets_1 <- tweets_corpus

# remove punctuationwatertag_corpus <- tm_map(watertag_corpus, removePunctuation)tweets_corpus <- tm_map(tweets_corpus, removePunctuation)# remove numberswatertag_corpus <- tm_map(watertag_corpus, removeNumbers)tweets_corpus <- tm_map(tweets_corpus, removeNumbers)# convert to lower casewatertag_corpus <- tm_map(watertag_corpus, content_transformer(tolower))tweets_corpus <- tm_map(tweets_corpus, content_transformer(tolower))

3

# remove whitespacewatertag_corpus <- tm_map(watertag_corpus, stripWhitespace)tweets_corpus <- tm_map(tweets_corpus, stripWhitespace)# remove stopwords such as 'you', 'me', etc.watertag_corpus <- tm_map(watertag_corpus, removeWords, stopwords("english"))tweets_corpus <- tm_map(tweets_corpus, removeWords, stopwords("english"))

# remove URLs# We'll create a function to look for 'http' in our text, and then delete the linksremoveURL <- content_transformer(function(x) gsub("http[[:alnum:]]*", "", x))watertag_corpus <- tm_map(watertag_corpus, removeURL)tweets_corpus <- tm_map(tweets_corpus, removeURL)

#inspect our resultsinspect(head(watertag_corpus,3))

## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>#### [[1]]## <<PlainTextDocument (metadata: 7)>>## vsco afterlight winter wisconsin water lake michigan frozen milwaukee city#### [[2]]## <<PlainTextDocument (metadata: 7)>>## taking brothers place <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E38><U+383C><U+3E32><U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E38><U+383C><U+3E32> frozen water sandwich breakingout ugly#### [[3]]## <<PlainTextDocument (metadata: 7)>>## rt waternetwork dsrsd certified water quality testing

inspect(head(tweets_corpus,3))

## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>#### [[1]]## <<PlainTextDocument (metadata: 7)>>## missjem someone already hasthough borrow dr whos tardis next difficulty#### [[2]]## <<PlainTextDocument (metadata: 7)>>## join conversation d d#### [[3]]## <<PlainTextDocument (metadata: 7)>>## beautifulfeet read mischievous thoughts well d

Other transformations possible with tm_map can obtained by running getTransformations()

In many applications, words need to be stemmed to retrieve their radicals, so that various forms derived from a stemwould be taken as the same when counting word frequency. Stemming uses an algorithm that removes common wordendings for English words, such as “es”, “ed” and “’s”. For instance, words “update”, “updated” and “updating”would all be stemmed to “updat”. It’s not mandatory (and sometimes it may be counter-productive), but it doespay to understand what it does, so we’ll demonstrate:

4

# create a copy we'll stemwatertag_stemmed <- watertag_corpustweets_stemmed <- tweets_corpus# stem wordslibrary(SnowballC)watertag_stemmed <- tm_map(watertag_stemmed, stemDocument)tweets_stemmed <- tm_map(tweets_stemmed, stemDocument)# inspect our stemmed resultsinspect(head(watertag_stemmed,3))

## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>#### [[1]]## <<PlainTextDocument (metadata: 7)>>## vsco afterlight winter wisconsin water lake michigan frozen milwauke citi#### [[2]]## <<PlainTextDocument (metadata: 7)>>## take brother place <U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E38><U+383C><U+3E32><U+653C><U+3E64><U+613C><U+3E30><U+623C><U+3E64><U+653C><U+3E64><U+623C><U+3E38><U+383C><U+3E32> frozen water sandwich breakingout ugli#### [[3]]## <<PlainTextDocument (metadata: 7)>>## rt waternetwork dsrsd certifi water qualiti test

inspect(head(tweets_stemmed,3))

## <<VCorpus (documents: 3, metadata (corpus/indexed): 0/0)>>#### [[1]]## <<PlainTextDocument (metadata: 7)>>## missjem someon alreadi hasthough borrow dr whos tardi next difficulti#### [[2]]## <<PlainTextDocument (metadata: 7)>>## join convers d d#### [[3]]## <<PlainTextDocument (metadata: 7)>>## beautifulfeet read mischiev thought well d

A term-document matrix represents the relationship between terms and documents, where each row stands for aterm and each column for a document, and an entry is the number of occurrences of the term in the document.

In contrast, a document-term matrix is simply the transpose of the term-document matrix, with documents asrows, and columns as terms.

So which should you use? Either of your choice!

#creating term-document matriceswatertag_tdm<-TermDocumentMatrix(watertag_corpus)tweets_tdm<-TermDocumentMatrix(tweets_corpus)

#creating document-term matriceswatertag_dtm<-DocumentTermMatrix(watertag_corpus)

5

tweets_dtm<-DocumentTermMatrix(tweets_corpus)

# just to compare the two:watertag_tdm

## <<TermDocumentMatrix (terms: 311, documents: 50)>>## Non-/sparse entries: 492/15058## Sparsity : 97%## Maximal term length: 61## Weighting : term frequency (tf)

watertag_dtm

## <<DocumentTermMatrix (documents: 50, terms: 311)>>## Non-/sparse entries: 492/15058## Sparsity : 97%## Maximal term length: 61## Weighting : term frequency (tf)

As seen above, except for their transpose, their practically the same.

With our matrix, we can perform quite a number of functions. Like, if we wanted to know the frequency of occurencefor some words:

#find terms which occur 5 times or morefindFreqTerms(watertag_dtm, 5)

## [1] "amp" "ice" "sun" "water"

#how about 10 times or more?findFreqTerms(tweets_tdm, lowfreq=10)

## [1] "man" "now"

It is important to note the results are ordered alphabetically, not according to frequency of occurence.

If we want it according to frequency, we’ll obtain it as a vector by converting into a matrix and using the rowSumsfunction if we’re using a tdm, and colSums if dtm:

#remember you can use either dtm or tdm - we're using both interchangeably just to demonstratewatertag_freq <- colSums(as.matrix(watertag_dtm))tweets_freq <- rowSums(as.matrix(tweets_tdm))

…and then we sort it in descending order, so it shows the terms with maximum occurence first:

#display head of most frequent termshead(sort(watertag_freq,decreasing=TRUE))

## water ice amp sun frozen fun## 57 7 5 5 4 4

6

head(sort(tweets_freq,decreasing=TRUE))

## man now just said home never## 10 10 7 7 6 6

We could even see the frequency of frequencies, to know how many times some terms appear:

head(table(watertag_freq),15)

## watertag_freq## 1 2 3 4 5 7 57## 200 85 16 6 2 1 1

head(table(tweets_freq),15)

## tweets_freq## 1 2 3 4 5 6 7 10## 569 85 24 21 4 4 2 2

This is tells us that from our search, 200 terms occur just once; and from our tweets, 569 terms occur just once, andso forth…

We could also retrieve associations between words: if two words appeared together, then their correlation would be1.0; if never: 0.0. Those are the boundaries.So, let’s say we wanted to see words that have at least a 0.5 correlation with the word ‘time’ in our search results:

findAssocs(watertag_dtm, "time", corlimit=0.5)

## $time## numeric(0)

Note that a result of type(0) indicates no correlating words were found, meaning the word you searced didn’t occur(to the level of correlation you specified).How about the words ‘trend’ and ‘food’ from our timeline, this time with a 0.4 correlation?

findAssocs(tweets_tdm, c("trend","food"), corlimit=0.4)

## $trend## numeric(0)#### $food## diner garbage” protested rat siryou tastes cook## 1.00 1.00 1.00 1.00 1.00 1.00 0.70## money paid yet stunned good like## 0.70 0.70 0.70 0.57 0.49 0.49

What if we wanted to graphically represent our results? We could, and it only require a few lines of code.For example: let’s make a barplot of all the terms that occur at least 5 times from text source(s). (5 is considerablysmall, but serves this particular example well)

7

#using ggplot2 packagelibrary(ggplot2)#from our search on Twitterqplot(names(watertag_freq[watertag_freq>=5]), watertag_freq[watertag_freq>=5], geom="bar",

stat="identity", xlab="Frequency", ylab="Terms", main="Search Results") + coord_flip()

Figure 1: Words Occuring At Least 5 Times

#from our timelineqplot(names(tweets_freq[tweets_freq>=5]), tweets_freq[tweets_freq>=5], geom="bar",

stat="identity", xlab="Frequency", ylab="Terms", main="@55wordsorless Timeline") + coord_flip()

Figure 2: Words Occuring At Least 5 Times

8

Wordclouds are also a very cool graphical representation of textual information. Here, the more frequently a wordoccurs, the bolder and larger it is displayed, with the reverse being true.By default the most frequent words have a font scale of 4 and the least have a scale of 0.5, but even that can bechanged, as we’ll demonstrate!

tweets_freq<-sort(tweets_freq,decreasing=TRUE)watertag_freq<-sort(watertag_freq,decreasing=TRUE)

#wordcloud package allows us to produce wordcloudslibrary(wordcloud)

#each time wordcloud is run, it randomly produces a layout.#Though it doesn't really matter, you can set the seed to keep the layout the sameset.seed(77)#'min.freq' specifies the minimum frequency of the words to be plottedwordcloud(names(tweets_freq), tweets_freq, min.freq=3)

Figure 3: Wordcloud Using min.freq

#max.words specifies the maximum number of words it should plot#scale changes font scalewordcloud(names(tweets_freq), scale=c(5, .1), tweets_freq, max.words=100)

## Warning in wordcloud(names(tweets_freq), scale = c(5, 0.1), tweets_freq, :## now could not be fit on page. It will not be plotted.

9

Figure 4: Wordcloud Using max.words

10

#just adding some colour!set.seed(79)wordcloud(names(watertag_freq), watertag_freq, min.freq=2,

random.color=TRUE,colors=rainbow(7))

Figure 5: Wordcloud With Colour!

Run ?wordcloud for even more options you can specify.

Word clusters can also be generated.

Hierarchical

Let’s first remove some sparse words that occur minimally and are not so important with removeSparseTerms. Thevalue of sparse is a numeric serving as a factor - terms occuring less than the specified percentage are retained.

#we're using 0.95 because our text source has only a few terms, and not many re-ocurring wordstweets_sparsed <- removeSparseTerms(tweets_tdm, sparse=0.95)distMatrix <- dist(scale(tweets_sparsed))fit <- hclust(distMatrix, method="ward")

## The "ward" method has been renamed to "ward.D"; note new "ward.D2"

plot(fit)# cut tree into 10 clustersrect.hclust(fit, k=10)

(groups <- cutree(fit, k=10))

## ’ll bed check died discovered good## 1 2 3 1 3 4## home just kill know last like

11

Figure 6: Cluster Dendogram

12

## 5 6 7 8 7 4## love made man marry mischievous never## 1 8 1 1 3 2## new now one said smiled steel## 4 9 4 1 4 3## still swore take things took went## 6 2 1 3 10 5## wife## 2

K-means

We can use the k-means clustering in our analysis. However, for this you MUST use a document-term matrix.

#using DOCUMENT-TERM matrix(tweets_sparsed <- removeSparseTerms(tweets_dtm, sparse=0.95))

## <<DocumentTermMatrix (documents: 79, terms: 31)>>## Non-/sparse entries: 153/2296## Sparsity : 94%## Maximal term length: 11## Weighting : term frequency (tf)

#setting our value of kk <- 4kmeansResult <- kmeans(tweets_sparsed, k)# cluster centersround(kmeansResult$centers, digits=3)

## ’ll bed check died discovered good home just kill know last## 1 0.034 0 0 0.051 0 0.051 0.051 0.102 0.068 0.068 0.051## 2 0.000 0 1 0.000 1 0.000 0.000 0.000 0.000 0.000 0.000## 3 0.167 0 0 0.083 0 0.083 0.000 0.083 0.000 0.083 0.083## 4 0.000 1 0 0.000 0 0.000 0.750 0.000 0.000 0.000 0.000## like love made man marry mischievous never new now one said## 1 0.051 0.051 0.068 0.0 0.000 0.017 0.051 0.068 0.119 0.085 0.000## 2 0.000 0.000 0.000 1.0 0.000 1.000 0.000 0.000 0.000 0.000 0.000## 3 0.083 0.083 0.083 0.5 0.333 0.000 0.000 0.000 0.250 0.000 0.583## 4 0.000 0.000 0.000 0.0 0.000 0.000 0.750 0.000 0.000 0.000 0.000## smiled steel still swore take things took went wife## 1 0.102 0 0.068 0.000 0.017 0 0.085 0.051 0.017## 2 0.000 1 0.000 0.000 0.000 1 0.000 0.000 0.000## 3 0.000 0 0.000 0.083 0.250 0 0.083 0.000 0.083## 4 0.000 0 0.000 0.750 0.000 0 0.000 0.250 0.500

To make things easier, let’s just print the top three words in every cluster, as well as the wordcloud cluster:

for (i in 1:k){cat(paste("cluster ", i, ": ", sep=""))s <- sort(kmeansResult$centers[i,], decreasing=T)cat(names(s)[1:3], "\n")# if you want to print the tweets of every cluster, run the next line# print(tweets[which(kmeansResult$cluster==i)])}

13

## cluster 1: now just smiled## cluster 2: check discovered man## cluster 3: said man marry## cluster 4: bed home never

Social Network Analysis

First, we want to produce a term-term matrix, which is basically just a network of terms based on their co-occurrencein tweets. It is the matrix product of the term-document and a document-term matrices. (We produce the matrixproduct by using the operator **%*%**).

#matrix product;#using sparsed tweets because original tdm in our example had too many sparse terms#transposing with 't' operatortweets_sparsed <- removeSparseTerms(tweets_tdm, sparse=0.95)termTerm <- as.matrix(tweets_sparsed) %*% as.matrix(t(tweets_sparsed))#inspect few rows and columnstermTerm[1:10,1:10]

## Terms## Terms ’ll bed check died discovered good home just kill know## ’ll 4 0 0 0 0 0 0 0 1 0## bed 0 4 0 0 0 0 3 0 0 0## check 0 0 4 0 4 0 0 0 0 0## died 0 0 0 4 0 0 0 1 0 1## discovered 0 0 4 0 4 0 0 0 0 0## good 0 0 0 0 0 4 0 1 0 0## home 0 3 0 0 0 0 8 0 0 0## just 0 0 0 1 0 1 0 7 0 0## kill 1 0 0 0 0 0 0 0 4 0## know 0 0 0 1 0 0 0 0 0 5

After this, we can use package igraph to graphically represent these network of terms in a visually-appealing way:

library(igraph)# build a graph from the above matrixg <- graph.adjacency(termTerm, weighted=T, mode="undirected")# remove loopsg <- simplify(g)# set labels and degrees of verticesV(g)$label <- V(g)$nameV(g)$degree <- degree(g)

# setting seed to make the layout reproducibleset.seed(1001)#call to plot networklayout1 <- layout.fruchterman.reingold(g)plot(g, layout=layout1)

What if we wanted a different layout?

plot(g, layout=layout.kamada.kawai)

14

Figure 7: Network Of Terms

15

Figure 8: Different Layout

16

What if we wanted an interactive network plot? Easy!

tkplot(g, layout=layout1)

In fact, in our interactive graphs, we can just change the layouts immediately by selecting different options in theLayout tab.

But the above just produce a graph with a lot of connections. What if we wanted to see straightaway which weremore important? Which connections were stronger? We can do that by specifying options with the following code:

#make stronger connections more bold on vertices 'V'V(g)$label.cex <- 2.2 * V(g)$degree / max(V(g)$degree)+ .2#colorV(g)$label.color <- rgb(0, 0, .2, .8)#no frameV(g)$frame.color <- NAegam <- (log(E(g)$weight)+.4) / max(log(E(g)$weight)+.4)# access edges 'E'E(g)$color <- rgb(.5, .5, 0, egam)E(g)$width <- egam# plot the graph in layout1plot(g, layout=layout1)

…and straightaway we can see which words are more ‘weighted’, and even point out one or two clusters…

How about making this new graph interactive too? As before, just use tkplot:

tkplot(g, layout=layout1)

As usual, there are a plethora of options and settings at your disposal! Just run ?igraph::layout to see them! (we’respecifying the package because you might have another layout funtion from another package)

17

Figure 9: Weighted Network

18