twitter mining
TRANSCRIPT
![Page 1: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/1.jpg)
Microblog(Twitter) mining
yutao
![Page 2: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/2.jpg)
What is twitter?
• 140 character tweet• Hashtag # before relevant keywords in tweet• RT means to “re-tweet” or forward a tweet • @ reference refers to a user’s screen name
![Page 3: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/3.jpg)
Why it is different?
• Very short in length• Written in informal style• Social
![Page 4: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/4.jpg)
What is twitter, a social network or a news media?(www2010)
• Following is mostly not reciprocated(not so “social”)
• Users talk about timely topics• A few users reach large audience directly• Most users can reach large audience by word-
of-mouth quickly
![Page 5: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/5.jpg)
early Analysis
![Page 6: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/6.jpg)
Analysis 1: Take the people out
• Krishnamurthy et al (2008) • users were classified by
follower/following counts, Numbers and ratios
• means and mechanisms of their engagement
Web (61.7%), mobile/text (7.5%), software (22.4%)
![Page 7: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/7.jpg)
Analysis 2: Content Category
Four meta-categories • daily chatter• conversations• information / URL sharing• news reporting
![Page 8: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/8.jpg)
Analysis 3: measuring user influence
• Indegree, retweets and mentions• Strong correlation between retweet and
mention• Most connected != most influential
![Page 9: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/9.jpg)
User influence
![Page 10: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/10.jpg)
![Page 11: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/11.jpg)
![Page 12: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/12.jpg)
![Page 13: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/13.jpg)
How to detect spam?
• classification• Content attributes
hashtags, trending topicsreplies, mentions, http links
• User behavior attributesage of user account
• Graph based attribute
![Page 14: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/14.jpg)
Sentiment analysis
• Supervised classification• Training data come from twitter, instead of
human labeled• Happy emotions: “:-)”, “:)”, “=)”, “:D” etc• Sad emotions: “:-(”, “:(”, “=(”, “;(” etc• Objective: newspapers and magzines
such as “NY times”
![Page 15: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/15.jpg)
Trend detection
• Bursty keywords detection• Bursty keywords grouping• Context extraction(such as PCA, SVD)
![Page 16: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/16.jpg)
twitter search(wsdm2011)
![Page 17: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/17.jpg)
The largest difference
• Twitter search order by time• Search engine order by relevance
• Social• Time
![Page 18: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/18.jpg)
recommendation
![Page 19: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/19.jpg)
Recommending content from information streams
• The filtering problem:– “I get 1000+ items in my stream daily but only
have time to read 10 of them. Which ones should I read?”
• The Discovery Problem:– “There are millions of URLs posted daily on twitter.
Am I missing something important there outside my own Twitter stream?”
![Page 20: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/20.jpg)
Recommending content from information streams
• Recency of content: only interesting within a short time after published.– always a “cold start” situation
• Explicit interaction among users– Explicitly interact by subscribing or sharing
• User-generated content– People are content producers as well as
consumers
![Page 21: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/21.jpg)
Recommending content from information streams
![Page 22: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/22.jpg)
URL Sources
• Considering all URLs was impossible• FoF : URLs from followee-of-followees• Popular : URLs that are popular across whole
![Page 23: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/23.jpg)
Topic relevance scores
• Topic profile of URLs– Use term vectors as profiles– Built from tweets that have mentioned the URL
• Topic profile of users– Self-topic: content profile based on what I post– Followee-Topic: content profile based on what my
followees post
![Page 24: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/24.jpg)
Social network scores
• “Popular Vote” in among my followees-of-followees– People “vote” a URL by tweeting it– Votes are weighted using social network structure– URLs with more votes in total are assigned higher
score
![Page 25: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/25.jpg)
![Page 26: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/26.jpg)
Recommending twitter users to follow
• Social graph• Profile user– User himself– Followers– followees
![Page 27: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/27.jpg)
Microblog summarization
![Page 28: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/28.jpg)
The phrase reinforcement algorithm
• Looking for the most commonly occurring phrases– Users tend to use similar words when describing a
particular topic– RT
![Page 29: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/29.jpg)
![Page 30: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/30.jpg)
Hybrid TF-IDF summarization
• TF: the document is the entire collection of posts
• IDF: the document is a single post
![Page 31: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/31.jpg)
Topic model
![Page 32: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/32.jpg)
32
Content modeling on Twitter
Surface word features
tf.idf cosine similarity,
etc.
Deeper natural
language processing
Parsing, parts of speech,
coreference, etc
dats yur mom not me lol
THE_REAL_SHAQ
![Page 33: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/33.jpg)
33
Best model in ranking
experiments
Labeled LDA
Content modeling on Twitter
Surface word features
Topic models, Dimensionality
reduction
Supervised classification
#hashtags, emoticons,
questions, etc.
tf.idf cosine similarity,
etc.
Latent Dirichlet Allocation (LDA),
LSA, etc.
Naïve Bayes,SVM, etc.
![Page 34: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/34.jpg)
34
Content modeling with Labeled LDADiscover unlabeled topicsParameter K=200 latent
topic dimensions
Model common labels500 - 1000 dimensions for hashtags, emoticons, etc.
obama president american america says country russia pope island
I’m going go out gonna see im tonight sleep tomorrow about am night
:) good day morning thanks have happy hope birthday
:) can‘t wait see one yay!!! cant tomorrow got !! next christmas
Smile : )
#jobs featured manager sales engineer yahoo location senior
#jobs
![Page 35: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/35.jpg)
35
Content modeling with Labeled LDA
new muppetblog political commentary link
@kermit heyy wanna catch a movie
just ate a cookie #yummy
4 1 1 1
2 2 2 3 3
5 5 #yummy #yummy
Histogram as signature for set of posts
4 1 1 1
2 2 2 3 3
5 5 #yummy #yummy
![Page 36: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/36.jpg)
36
Twitter content by category
Substance27%
Status12%
Style38%
Social23%
can make help if someone tell_me them anyone use makes any sense trying explain
obama president american america says country russia pope island failed honduras
haha lol :) funny :p omg hahaha yeah too yes thats ha wow cool lmao though kinda
am still doing sleep so going tired bed awake supposed hell asleep early sleeping sleepy
night sleep bed going off tomorrow bye tonight goodnight all im time now nite
iphone new phone app mobile apple ipod blackberry touch pro store apps free android an
up what's hit pick whats hey set twitter sign give catch when show first wats make
im get dont gonna shit gotta wanna cuz damn ur make cant say cause bout ill mad tired
![Page 37: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/37.jpg)
37
Characterizing Microblogs with Topic Models
Outline• Modeling Twitter content with topic models• Characterizing, recommending and filtering
![Page 38: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/38.jpg)
Characterizing users
![Page 39: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/39.jpg)
Characterizing users
![Page 40: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/40.jpg)
TwitterRank: Finding Topic-sensitive Influential Twitterers
• Apply LDA to distill topics automatically• Find topics in the twitterer’s content to
represent her interests– Twitterer’s content = aggregated tweets
• Twitterers with “following” relationships are more similar than those without according to the topics they are interested in
![Page 41: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/41.jpg)
Topic-specific TwitterRank
![Page 42: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/42.jpg)
Interesting application
• Personalized and automatic social summarization of events in video
• Twitter Can Predict the Stock Market• Predicting elections with twitter• Earthquake(time, location)
![Page 43: Twitter mining](https://reader035.vdocuments.us/reader035/viewer/2022062307/554e8701b4c90526358b4738/html5/thumbnails/43.jpg)
thanksmany pictures and slides come from the internet