challenges analyzing twitter public opinion …...twitter users and the general public, and estimate...

The Challenges in Analyzing Twitter Data for Public Opinion Researchers

Masahiko Aida, Director of Analytics

WHY TWITTER?

© 2012 All Rights Reserved. 1

• Access to Twitter data is open, unlike Facebook

• User base is large– 140 million users (901 million FB users, 100 million Google+ users)

• Ubiquitous among Politicians in the US (as of May 2012)

– 375 House members (of 435)– 92 Senators ( of 100)– 49 Governors (of 51)

CHALLENGES IN TWITTER ANALYSIS

• Sampling and Coverage Problem– The volume of Twitter data can be large and it can be costly to

obtain and store– Coverage issue

• Text Analysis Problem

• Inference Problem


DATA SIZE AND SAMPLING


Source: from twitter

• During the State of Union address, there were 766,681 SOTU related tweets in 95 minutes. (8,070 tweets per minute). The data will be approximately 600MB.

• Imagine saving all the tweets during an election year.

• However, one can sample and save subset.

BarackBarack

COVERAGE – OBAMA TWEETS EXAMPLE


• Ideally, we want to assign a non‐zero chance of selection to all the tweets that discuss a particular topic.

• However, with 340 million tweets a day, it is extremely inefficient to pull a random sample.– Ex. There are about 20,000 tweets

that include “Barack Obama” a day, it is 0.0059% of all tweets.

• Another possibility is to query related words such as “the president”.– However, it will increase noise.

ObamaObama

The President

The President

Universe

Missed tweets (φ)

Irreverent tweets

Irreverent tweets

POTENTIAL SOLUTION FOR COVERAGE ERRORS


• Stratified Approach – create list of users from a keyword query and pull tweets targeting user IDs.

High Density Set of users who tweeted Obama within 3 days Least Expensive

Mid Density Set of users who tweeted Obama within 1 week Inexpensive

Low Density Set of users who tweeted Obama within 1 month Expensive

Very Low Density

Users who have not tweeted Obama more than 1 month Cost Prohibitive

TEXT ANALYSIS


• Many of our tools are assuming numeric data and it is very difficult to translate/map text sentences into numeric values.

• Several vendors offer rule based sentiment scores.– Ex. Like, love, greedy, enthusiastic– Cannot handle sarcasm.– Vendors use secrete proprietary algorithms to code

• Alternative: supervised learning methods

HOW SENTIMENT CODINGWORKS


Tweets

Human coder classifies and assigns sentiment (training dataset)

Create supervised learning models

Tweets

Tweets

Tweets

Example: R‐text‐tool: Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, EmilianoGrossman and Wouter van Atteveldt (2012). RTextTools: Automatic Text Classification via Supervised Learning. R package version 1.3.6. http://CRAN.R‐project.org/package=RTextTools

OTHER TEXT ANALYSIS EXAMPLES


• Forget about quantifying – use a strictly visual approach.– Appearances of nouns and adjectives.– Process tweets with natural language processing software

• Network analysis– Identify “influentials” in the network

• Data: Twitter data that includes following1. Mitt Romney2. WI recall election

FREQUENT TERMS THAT DESCRIBE ROMNEY


Increased mentions of Romney’s money

War on Women

Romney’s proposed tax rate

Data: Sample of tweets that include “Mitt Romney”.

Processed with natural language analysis package using Python.

Data: Sample of tweets that include “Mitt Romney”.

Processed with natural language analysis package using Python.

GOP PRIMARY POLLS AND TWITTER


1st row : GOP primary public polling summary.2nd row : frequency of candidate names from twitter sample.

VISUALIZINGMENTIONS : WI RECALL ELECTION


Liberal News MediaLiberal News Media Tea Party Types

Tea Party Types

Method: Collect sample of tweets that include “WI re‐election”, “Scott Walker”.

Visualize relationships of mentions.

Rasmussen pollRasmussen poll

CONSERVATIVE TWITTER ACCOUNTS


Brother of Rush Limbaugh

Brother of Rush Limbaugh

Network visualization allows us to see popular news sources and how mentions are clustered ideologically.

PROBLEM OF INFERENCE I


• We may develop better means of predicting sentiment and sampling, thus the measurement of Twitter opinion will improve as we gain experience.

• However, the distribution of opinions on Twitter is not directly transferable to the opinions of the general public or likely voters.

• Can we find a way to infer the opinions of the general population from Twitter data? Maybe.

PROBLEM OF INFERENCE II


• The purpose of research is not necessarily obtaining unbiased point estimates of a population parameter.

No smoke where there is no fire.

– Ex. Suppose one person claims “All swans are white.”– I just need one black swan to prove that is not the case.

PROBLEM OF INFERENCE III : MODEL BASED


• If we can approximate the mechanism that separates Twitter users and the general public, and estimate opinion with the correct specification, the bias will be smaller.

– Ex. Heckman model, sample matching (YouGovPolimetrix surveys)

– Ex. Twitter sentiment of a college educated gay black man who lives in Ohio

– Likely support Obama and do not favor Romney.

SUMMARY


• Technical challenges that can be solved– Sampling and coverage issue in Twitter sampling– Mapping of text data into a scale

• Problems that are hard to solve in the near future– Generalization of opinion distribution to general public

• Think differently– Use Twitter to find emerging or rare patterns– Use Twitter to see how people are obtaining information– Different types of inference (find smoke)

challenges analyzing twitter public opinion …...twitter users and the general public, and estimate...

Documents