big data analytics: discovering latent structure in twitter; a case study in the tragic aftermath of...
TRANSCRIPT
Richard Heimann © 2013
Big Social Data Analysis: Using location & Twitter to explore the tragic aftermath of the
Sandy Hook Elementary School shooting !!
Richard Heimann Chief Data Scientist at L-3 Data Tactics
Adjunct Professor at UMBC !
Keegan Hines Data Scientist at L-3 Data Tactics
Richard Heimann © 2013
Counting, counting, counting:Why do we count? How do we count? !
What is measurement? What are latent constructs? !
Traditional Data, Nontraditional Data | Sample vs. Population | Model Organism | Good Data, Bad Data!
Case Study:Analyzing the discussing following the tragic events in Newtown CT.
Big Social Data Analysis: A Case Study in Newtown
Richard Heimann © 2013
Counting, counting, counting…
Richard Heimann © 2013
How do we Count?
http://datatactics.blogspot.com/2013/07/analytics-in-perspective-inquiry-into.html
Analytics in Perspective: An Inquiry into Modes of Inquiry
Notice all these categories are counting. We count everything all the time.
Richard Heimann © 2013
Counting and counting…A: 75% of Americans favor some level of gun control.
A{spatial}: Americans in the northeast favor aggressive gun control by 3:1 over the south and midwest.
B: Most Americans favor some level of gun control.
B{spatial}: Americans in the northeast favor aggressive gun control by more than double Americans in the south and midwest.
C: Americans favor gun control.
C{spatial}: Americans in the northeast favor aggressive gun control over the south and midwest.
D: All Americans favor gun control.
Richard Heimann © 2013
Counting and counting…D: All Americans favor gun control.
- Many
- Much
- Some
- Numerous
- A little, A lot
- Often
- Always
- Rarely
Richard Heimann © 2013
Why Quantitative Analysis?
Why is quantitative data analysis so important? !
“…the alternative to good statistics is not “no statistics,” it’s bad statistics. People who argue
against statistical reasoning often end up backing up their arguments with whatever numbers they have at
their command, over- or under-adjusting in their eagerness to avoid anything systematic”
!Bill James
Richard Heimann © 2013
Here the different Tribes
meet in Friendship and collect Stone
for Pipes.
E.g. Quasi-Geo-qualitative Analysis?
Yanktons a Band of
Sioux - 1000 Souls
F. Ratzel, C. Wissler, & C. Sauer: Culture Area Research
and Mapping (1850’s)
Maps Descriptive of London Poverty (1899)
“No. 34 is occupied by the widow of a boatman. He committed suicide and left her with eleven children. Some have died, and she has five here now, two of whom go to work, and three to school. She makes sailor jackets, but is nearly blind. Struggles hard for her children…”
Richard Heimann © 2013
Measurement…
Richard Heimann © 2013
Goal: /all/ measurement is to arrange items on a continuum (observed or unobserved).
What is measurement?
Richard Heimann © 2013
You can see a lot by observing… !
Yogi Berra
Richard Heimann © 2013
Q: How can we measure something that is unobserved, or for which there is no direct measure?!A: Use a statistical model to measure the relationships between the observable variables and the unobserved (or “latent”) quantity.
What is measurement - real world?
Richard Heimann © 2013
Estimating unobservable quantities: E.g. “topic or theme”
Twitter Word1 Word2 Word3 topic
@TheDude 1 1 0
@WalterSobchak 0 1 1
@TheBigLebowski 1 0 1
… … … …
@Donny 1 1 0
What is measurement - real world?
Richard Heimann © 2013
Traditional Data, Nontraditional Data…
Richard Heimann © 2013
Traditional Approaches to SocSci InquiryFor example Gun Control:
Surveys - that is, ask people what their position is about gun control.
…but, who? how many? Your friends? Family? People in your neighborhood? This is expensive.
Polls - similar to above but you often offer multiple choice.
…but, how do you construct the questions? How many questions? Same issues as above. This is expensive.
Legislature - count votes and federal/state funding.
…but, what are we measuring? Lobbyist or American valence?
Gun Sales/Deaths: that is, count the number of gun sales and/or deaths.
…but, are these normalized values? Are existing gun control laws controlled for?
Richard Heimann © 2013
Text is not only big, but is growing at an increasing rate. Twitter was launched March 21, 2006 and it took 3 years, 2 months and 1 day to reach 1 billion tweets. Twitter users now send one billion every 2.5 days.
People are highly opinionated.Its inexpensive: > library(twitteR) > guncontrol<- searchTwitter("#guncontrol", n=n, cainfo=“cacert.pem”) !Its comprehensive:
Dec2012: ~30 days, ~210M tweets, ~40,334,000 users; #guncontrol ~14,500 tweets, ~ 10,200 users.
!!
Nontraditional Approaches to SocSci Inquiry
Richard Heimann © 2013
Population vs. SamplePopulation: The entire group under study.!Populations (N): often so large that we cannot examine the entire group. Samples are selected to represent the population. !Sample (n): Samples help answer questions about the population.!Nontraditional data allows n -> N
Richard Heimann © 2013
Twitter as a model organism:
Richard Heimann © 2013
What is Good Data?
Is garbage in, garbage out” a statement we ought to take seriously?
+ Data collected in targeted rigorous ways - aka AUTHORITATIVE!
E.g. Census data, Surveys, Polls.
- Data tends to be narrow in scope both geographically and temporally and infrequently measured, if ever again.
Linchpin is “when the data is available.” The vast majority of data relating to emerging questions related to business, politics and social science simply do not exist.
What is Bad Data?
Social Media is millions of conversations happening continuously and concurrently with varying degrees of decay and magnitude — Lots of signal and lots of noise. (+) We can ask a variety of questions from it.
Good data, Bad data
Richard Heimann © 2013
The opposite of good data is not bad data, it is no data.
!
The point: Good data and Bad data does not exist; there is just Data and NO Data.
Good data, Bad data
Richard Heimann © 2013
Question: Is Twitter a model organism?
!
Reality: We live in an imperfect world producing imperfect data - our job is
to work with it.
Good data, Bad data & A model organism
Richard Heimann © 2013
Case Study
Richard Heimann © 2013
General ProControl AgainstControl
#PrayForNewtownEmotional hashtag.
#gunfailpro gun control
#gunrightsanti gun control
#NRAvague, broad
#p2Refers to Progressives 2.0, the resource for progressives on social media. Progressivism is a political philosophy that prioritizes diversity and empowerment through social activism.
#2ndAmendmentanti gun control
#CTshootingvague, broad
#gunsense seems to be pro gun control
#2a2ndAmendmentanit gun control
#guncontrolvague, broad
#NowIsTheTimeseems to be pro gun control
#tcotTop Conservatives on Twitter: seems to be against gun control
#newtownEmotional hashtag.
Big Social Data Analysis: A Case Study in Newtown
Richard Heimann © 2013
Big Social Data Analysis: A Case Study in Newtown
> colnames(fulldf) [1] "ORIG_FILE" "TWEET_ID" "TIMESTAMP" "SCREEN_NAM" "TRUE_NAME" [6] "GENDER" "LOCATION" "LONG" "LAT" "HASHTAGS" [11] "GT_COUNT" "PT_COUNT" "AT_COUNT" "LANG" "TEXT" [16] "NAME" "trimmedTweets" "Topic" "TimeStamp_Day" "IRT_score"
Richard Heimann © 2013
Big Social Data Analysis: A Case Study in Newtown
Bag of Words - the order of words doesn’t matter, we’re simply interested in which words were used
Forget it, Donny, you're out of your element!
Life does not start and stop at your convenience
and, at, convenience, does, life, not, start, stop, your
donny, element, forget, it, out, of your, you’re
Richard Heimann © 2013
Big Social Data Analysis: A Case Study in Newtown
Forget it, Donny, you're out of your element!
Life does not start and stop at your convenience
convenience, does, life, start, stop
donny, element, forget, out
Stopwords - remove words which are so common as to be uninformative (e.g. pronouns, articles)
Richard Heimann © 2013
Big Social Data Analysis: A Case Study in Newtown
Stemming - the same verb with different conjugations and tenses should be represented in just one way
I run, he runs, we enjoy running. run, run, enjoy, run
You mark that frame an 8, and you're entering a world of pain.
mark, frame, 8, enter, world, pain
Richard Heimann © 2013
Smokey, this is not 'Nam. This is bowling. There are rules.
How come you don't roll on Saturday, Walter?
I don't roll on Shabbos!
Well, sir, it's this rug I had. It really tied the room together.
I need to see you. I'm the one who took your rug.
Walter, he peed on my rug!
Topic Model - some tweets discuss similar things, they ought to be grouped together
Big Social Data Analysis: A Case Study in Newtown
Richard Heimann © 2013
Idea: Posit a number of latent “topics,” then estimate the relationship between words-in-topics, and topics-in-tweets.
Tweet n
Tweet 2…
TopicsWords/Terms
Documents aka: Tweets
Bowling
Rug
Big Social Data Analysis: A Case Study in Newtown
smokey
saturday
shabbos!
rug
Tweet 1'nambowling
rules
took needroll
walter
how
come
peed
Richard Heimann © 2013
With the twitter data, we’ll do a Topic model with 3 topics.
Topic # 1
This topic seems to capture
discussions of gun control and gun
rights as this political issue
emerged in the conversation.
Top Words Example Tweets wow, this shooting shit needs to stop.
#guncontrol now.
oh good. #obamao has put #biden in charge of #guncontrol. that makes me feel
all better about my rights and liberties.
gun control is like trying to reduce drunk driving by making it harder for sober
people to buy cars! #tcot #pjnet
#tcot#tlot
#nra
#2ndamendment#p2
gun #gun
control #newtown
#guncontrol #2a
rightamericaarmed
obamaassault
Big Social Data Analysis: A Case Study in Newtown
Richard Heimann © 2013
With the twitter data, we’ll do a Topic model with 3 topics.
This topic seems to capture general chatter, commonly used words, and
spam tweets.
senate#tcot
#tlotvideo
obama
house tell like
free fiscal taxnews
need via
help @newhampshirecr reach 600 followers! only 7 more to go! #nhcr #nhpolitics #tcot
rt @msegieda: @foxnews writes about @fracknation's premiere tonight on @axstv 9 pm et http://t.co/paafwgux #fracking #tcot #tlot #tp
michigan man, dog rescued after ice breaks: http://t.co/cc983uyv #tcot
Topic # 2 Top Words Example Tweets
With the twitter data, we’ll do a Topic model with 3 topics.
Big Social Data Analysis: A Case Study in Newtown
Richard Heimann © 2013
Big Social Data Analysis: A Case Study in Newtown
With the twitter data, we’ll do a Topic model with 3 topics.
Topic # 3 Top Words Example Tweets
This topic seems to capture
descriptions of the tragedy as well as
expressions of sympathy, and
sadness.
my heart goes out to everyone affected by the shooting at #sandycook elementary. a senseless tragedy. i can't imagine your pain. #newtown
dear god, please protect our babies from the monsters that live among us. my heart is breaking for those in #newtown
thoughts are with the students of #sandyhook #newtown sad situation
#prayfornewtown#newtown
#ctshootingchildren
schoolfamilies
prayers thoughtsvictims sad
tragedykids god
little
today
lanzarip
Richard Heimann © 2013
• With a topic model, we can extract topics that make intuitive sense!
• But what about the usage patterns of these topics?!
• Are there interesting temporal, ideological, or geographical trends/patterns?
Big Social Data Analysis: A Case Study in Newtown
Richard Heimann © 2013
Big Social Data Analysis: A Case Study in Newtown
12/14 Newtown Shootings
Cou
nt
Time/Date
12/16Obama openly wepts as he addressed the nation in the hours after the attack -- and stated that now was the time for "meaningful action" on gun violence.
12/19“In the coming weeks, I will use whatever power this office holds to engage my fellow citizens, from law enforcement to mental health professionals to parents and educators, in an effort aimed at preventing more tragedies like this," Obama said. "Because what choice do we have? We can't accept events like this as routine."
1/19Obama Presents Gun Control Agenda; Includes 23 Executive Orders
GunControlMixedSympathy
Richard Heimann © 2013
Big Social Data Analysis: A Case Study in Newtown
Richard Heimann © 2013
Big Social Data Analysis: A Case Study in NewtownEveryone
Richard Heimann © 2013
Big Social Data Analysis: A Case Study in Newtown
GunControl 2,733; 34.7%
Mixed 2,637; 33.5%
Sympathy 2,505; 31.8%
Everyone by Topic
Richard Heimann © 2013
Big Social Data Analysis: A Case Study in Newtown
Red: 2,179
Pink: 434
Purple: 887
Blue: 4,214
Light Blue: 161
Richard Heimann © 2013
GunControl Mixed Sympathy
TopicsG
eogr
aphy
Blue
Red
Pink
PurpleLight Blue
10% 9.2% 8.5%
18.3% 17.2% 18%
Richard Heimann © 2013
Spatially Explicit Theory
Proximate casualty hypothesis; (Gartner, Segura, and Wilkening 1997)!Time and space provide new insight on the multiple processes underlying opinion change in today’s complex information environment. !A case study of the “proximate casualties” hypothesis, the idea that popular support for American wars is undermined at the individual level more by the deaths of American personnel from nearby areas than by the deaths of those from far away.
http://jcr.sagepub.com/content/41/5/669.abstract
Richard Heimann © 2013
Richard Heimann © 2013
New England
Mountain
Middle Atlantic
East South Central
East North Central
West South Central
West North Central
Pacific
South Atlantic
GunControl Mixed Sympathy
TopicsG
eogr
aphy
Richard Heimann © 2013
SummaryCounting, counting, counting:
Why do we count? How do we count? Contrasting quantitative counting vs. qualitative counting. !
What is measurement? What are latent constructs? !Traditional vs. Nontraditional approaches to SocSci Inquiry.
Population vs. Sample. Model Organisms. Good data, Bad data. !
Case Study in Newtown using Twitter, Topic Modeling to explore temporal, ideological, and geographical elements.
Richard Heimann © 2013
Richard Heimann: @rheimann https://twitter.com/rheimann [email protected] Data Tactics Big Data Insights [blog]: http://datatactics.blogspot.com !!Keegan Hines: @keeghin https://twitter.com/keeghin
Thank you… Questions?