challenges analyzing twitter public opinion …...twitter users and the general public, and estimate...
TRANSCRIPT
The Challenges in Analyzing Twitter Data for Public Opinion Researchers
Masahiko Aida, Director of Analytics
WHY TWITTER?
© 2012 All Rights Reserved. 1
• Access to Twitter data is open, unlike Facebook
• User base is large– 140 million users (901 million FB users, 100 million Google+ users)
• Ubiquitous among Politicians in the US (as of May 2012)
– 375 House members (of 435)– 92 Senators ( of 100)– 49 Governors (of 51)
CHALLENGES IN TWITTER ANALYSIS
• Sampling and Coverage Problem– The volume of Twitter data can be large and it can be costly to
obtain and store– Coverage issue
• Text Analysis Problem
• Inference Problem
© 2012 All Rights Reserved. 2
DATA SIZE AND SAMPLING
© 2012 All Rights Reserved. 3
Source: from twitter
• During the State of Union address, there were 766,681 SOTU related tweets in 95 minutes. (8,070 tweets per minute). The data will be approximately 600MB.
• Imagine saving all the tweets during an election year.
• However, one can sample and save subset.
BarackBarack
COVERAGE – OBAMA TWEETS EXAMPLE
© 2012 All Rights Reserved. 4
• Ideally, we want to assign a non‐zero chance of selection to all the tweets that discuss a particular topic.
• However, with 340 million tweets a day, it is extremely inefficient to pull a random sample.– Ex. There are about 20,000 tweets
that include “Barack Obama” a day, it is 0.0059% of all tweets.
• Another possibility is to query related words such as “the president”.– However, it will increase noise.
ObamaObama
The President
The President
Universe
Missed tweets (φ)
Irreverent tweets
Irreverent tweets
POTENTIAL SOLUTION FOR COVERAGE ERRORS
© 2012 All Rights Reserved. 5
• Stratified Approach – create list of users from a keyword query and pull tweets targeting user IDs.
High Density Set of users who tweeted Obama within 3 days Least Expensive
Mid Density Set of users who tweeted Obama within 1 week Inexpensive
Low Density Set of users who tweeted Obama within 1 month Expensive
Very Low Density
Users who have not tweeted Obama more than 1 month Cost Prohibitive
TEXT ANALYSIS
© 2012 All Rights Reserved. 6
• Many of our tools are assuming numeric data and it is very difficult to translate/map text sentences into numeric values.
• Several vendors offer rule based sentiment scores.– Ex. Like, love, greedy, enthusiastic– Cannot handle sarcasm.– Vendors use secrete proprietary algorithms to code
• Alternative: supervised learning methods
HOW SENTIMENT CODINGWORKS
© 2012 All Rights Reserved. 7
Tweets
Human coder classifies and assigns sentiment (training dataset)
Create supervised learning models
Tweets
Tweets
Tweets
Example: R‐text‐tool: Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, EmilianoGrossman and Wouter van Atteveldt (2012). RTextTools: Automatic Text Classification via Supervised Learning. R package version 1.3.6. http://CRAN.R‐project.org/package=RTextTools
OTHER TEXT ANALYSIS EXAMPLES
© 2012 All Rights Reserved. 8
• Forget about quantifying – use a strictly visual approach.– Appearances of nouns and adjectives.– Process tweets with natural language processing software
• Network analysis– Identify “influentials” in the network
• Data: Twitter data that includes following1. Mitt Romney2. WI recall election
FREQUENT TERMS THAT DESCRIBE ROMNEY
© 2012 All Rights Reserved. 9
Increased mentions of Romney’s money
War on Women
Romney’s proposed tax rate
Data: Sample of tweets that include “Mitt Romney”.
Processed with natural language analysis package using Python.
Data: Sample of tweets that include “Mitt Romney”.
Processed with natural language analysis package using Python.
GOP PRIMARY POLLS AND TWITTER
© 2012 All Rights Reserved. 10
1st row : GOP primary public polling summary.2nd row : frequency of candidate names from twitter sample.
VISUALIZINGMENTIONS : WI RECALL ELECTION
© 2012 All Rights Reserved. 11
Liberal News MediaLiberal News Media Tea Party Types
Tea Party Types
Method: Collect sample of tweets that include “WI re‐election”, “Scott Walker”.
Visualize relationships of mentions.
Rasmussen pollRasmussen poll
CONSERVATIVE TWITTER ACCOUNTS
© 2012 All Rights Reserved. 12
Brother of Rush Limbaugh
Brother of Rush Limbaugh
Network visualization allows us to see popular news sources and how mentions are clustered ideologically.
PROBLEM OF INFERENCE I
© 2012 All Rights Reserved. 13
• We may develop better means of predicting sentiment and sampling, thus the measurement of Twitter opinion will improve as we gain experience.
• However, the distribution of opinions on Twitter is not directly transferable to the opinions of the general public or likely voters.
• Can we find a way to infer the opinions of the general population from Twitter data? Maybe.
PROBLEM OF INFERENCE II
© 2012 All Rights Reserved. 14
• The purpose of research is not necessarily obtaining unbiased point estimates of a population parameter.
No smoke where there is no fire.
– Ex. Suppose one person claims “All swans are white.”– I just need one black swan to prove that is not the case.
PROBLEM OF INFERENCE III : MODEL BASED
© 2012 All Rights Reserved. 15
• If we can approximate the mechanism that separates Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller.
– Ex. Heckman model, sample matching (YouGovPolimetrix surveys)
– Ex. Twitter sentiment of a college educated gay black man who lives in Ohio
– Likely support Obama and do not favor Romney.
SUMMARY
© 2012 All Rights Reserved. 16
• Technical challenges that can be solved– Sampling and coverage issue in Twitter sampling– Mapping of text data into a scale
• Problems that are hard to solve in the near future– Generalization of opinion distribution to general public
• Think differently– Use Twitter to find emerging or rare patterns– Use Twitter to see how people are obtaining information– Different types of inference (find smoke)