big data analytics: discovering latent structure in twitter; a case study in the tragic aftermath of...

44
Richard Heimann © 2013 Big Social Data Analysis: Using location & Twitter to explore the tragic aftermath of the Sandy Hook Elementary School shooting Richard Heimann Chief Data Scientist at L-3 Data Tactics Adjunct Professor at UMBC Keegan Hines Data Scientist at L-3 Data Tactics

Upload: richard-heimann

Post on 12-Jul-2015

942 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Big Social Data Analysis: Using location & Twitter to explore the tragic aftermath of the

Sandy Hook Elementary School shooting !!

Richard Heimann Chief Data Scientist at L-3 Data Tactics

Adjunct Professor at UMBC !

Keegan Hines Data Scientist at L-3 Data Tactics

Page 2: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Counting, counting, counting:Why do we count? How do we count? !

What is measurement? What are latent constructs? !

Traditional Data, Nontraditional Data | Sample vs. Population | Model Organism | Good Data, Bad Data!

Case Study:Analyzing the discussing following the tragic events in Newtown CT.

Big Social Data Analysis: A Case Study in Newtown

Page 3: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Counting, counting, counting…

Page 4: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

How do we Count?

http://datatactics.blogspot.com/2013/07/analytics-in-perspective-inquiry-into.html

Analytics in Perspective: An Inquiry into Modes of Inquiry

Notice all these categories are counting. We count everything all the time.

Page 5: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Counting and counting…A: 75% of Americans favor some level of gun control.

A{spatial}: Americans in the northeast favor aggressive gun control by 3:1 over the south and midwest.

B: Most Americans favor some level of gun control.

B{spatial}: Americans in the northeast favor aggressive gun control by more than double Americans in the south and midwest.

C: Americans favor gun control.

C{spatial}: Americans in the northeast favor aggressive gun control over the south and midwest.

D: All Americans favor gun control.

Page 6: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Counting and counting…D: All Americans favor gun control.

- Many

- Much

- Some

- Numerous

- A little, A lot

- Often

- Always

- Rarely

Page 7: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Why Quantitative Analysis?

Why is quantitative data analysis so important? !

“…the alternative to good statistics is not “no statistics,” it’s bad statistics. People who argue

against statistical reasoning often end up backing up their arguments with whatever numbers they have at

their command, over- or under-adjusting in their eagerness to avoid anything systematic”

!Bill James

Page 8: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Here the different Tribes

meet in Friendship and collect Stone

for Pipes.

E.g. Quasi-Geo-qualitative Analysis?

Yanktons a Band of

Sioux - 1000 Souls

F. Ratzel, C. Wissler, & C. Sauer: Culture Area Research

and Mapping (1850’s)

Maps Descriptive of London Poverty (1899)

“No. 34 is occupied by the widow of a boatman. He committed suicide and left her with eleven children. Some have died, and she has five here now, two of whom go to work, and three to school. She makes sailor jackets, but is nearly blind. Struggles hard for her children…”

Page 9: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Measurement…

Page 10: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Goal: /all/ measurement is to arrange items on a continuum (observed or unobserved).

What is measurement?

Page 11: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

You can see a lot by observing… !

Yogi Berra

Page 12: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Q: How can we measure something that is unobserved, or for which there is no direct measure?!A: Use a statistical model to measure the relationships between the observable variables and the unobserved (or “latent”) quantity.

What is measurement - real world?

Page 13: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Estimating unobservable quantities: E.g. “topic or theme”

Twitter Word1 Word2 Word3 topic

@TheDude 1 1 0

@WalterSobchak 0 1 1

@TheBigLebowski 1 0 1

… … … …

@Donny 1 1 0

What is measurement - real world?

Page 14: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Traditional Data, Nontraditional Data…

Page 15: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Traditional Approaches to SocSci InquiryFor example Gun Control:

Surveys - that is, ask people what their position is about gun control.

…but, who? how many? Your friends? Family? People in your neighborhood? This is expensive.

Polls - similar to above but you often offer multiple choice.

…but, how do you construct the questions? How many questions? Same issues as above. This is expensive.

Legislature - count votes and federal/state funding.

…but, what are we measuring? Lobbyist or American valence?

Gun Sales/Deaths: that is, count the number of gun sales and/or deaths.

…but, are these normalized values? Are existing gun control laws controlled for?

Page 16: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Text is not only big, but is growing at an increasing rate. Twitter was launched March 21, 2006 and it took 3 years, 2 months and 1 day to reach 1 billion tweets. Twitter users now send one billion every 2.5 days.

People are highly opinionated.Its inexpensive: > library(twitteR) > guncontrol<- searchTwitter("#guncontrol", n=n, cainfo=“cacert.pem”) !Its comprehensive:

Dec2012: ~30 days, ~210M tweets, ~40,334,000 users; #guncontrol ~14,500 tweets, ~ 10,200 users.

!!

Nontraditional Approaches to SocSci Inquiry

Page 17: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Population vs. SamplePopulation: The entire group under study.!Populations (N): often so large that we cannot examine the entire group. Samples are selected to represent the population. !Sample (n): Samples help answer questions about the population.!Nontraditional data allows n -> N

Page 18: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Twitter as a model organism:

Page 19: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

What is Good Data?

Is garbage in, garbage out” a statement we ought to take seriously?

+ Data collected in targeted rigorous ways - aka AUTHORITATIVE!

E.g. Census data, Surveys, Polls.

- Data tends to be narrow in scope both geographically and temporally and infrequently measured, if ever again.

Linchpin is “when the data is available.” The vast majority of data relating to emerging questions related to business, politics and social science simply do not exist.

What is Bad Data?

Social Media is millions of conversations happening continuously and concurrently with varying degrees of decay and magnitude — Lots of signal and lots of noise. (+) We can ask a variety of questions from it.

Good data, Bad data

Page 20: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

The opposite of good data is not bad data, it is no data.

!

The point: Good data and Bad data does not exist; there is just Data and NO Data.

Good data, Bad data

Page 21: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Question: Is Twitter a model organism?

!

Reality: We live in an imperfect world producing imperfect data - our job is

to work with it.

Good data, Bad data & A model organism

Page 22: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Case Study

Page 23: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

General ProControl AgainstControl

#PrayForNewtownEmotional hashtag.

#gunfailpro gun control 

#gunrightsanti gun control 

#NRAvague, broad

#p2Refers to Progressives 2.0, the resource for progressives on social media. Progressivism is a political philosophy that prioritizes diversity and empowerment through social activism.

#2ndAmendmentanti gun control  

#CTshootingvague, broad

#gunsense seems to be pro gun control 

#2a2ndAmendmentanit gun control 

#guncontrolvague, broad

#NowIsTheTimeseems to be pro gun control 

#tcotTop Conservatives on Twitter: seems to be against gun control 

#newtownEmotional hashtag.

Big Social Data Analysis: A Case Study in Newtown

Page 24: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Big Social Data Analysis: A Case Study in Newtown

> colnames(fulldf) [1] "ORIG_FILE" "TWEET_ID" "TIMESTAMP" "SCREEN_NAM" "TRUE_NAME" [6] "GENDER" "LOCATION" "LONG" "LAT" "HASHTAGS" [11] "GT_COUNT" "PT_COUNT" "AT_COUNT" "LANG" "TEXT" [16] "NAME" "trimmedTweets" "Topic" "TimeStamp_Day" "IRT_score"

Page 25: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Big Social Data Analysis: A Case Study in Newtown

Bag of Words - the order of words doesn’t matter, we’re simply interested in which words were used

Forget it, Donny, you're out of your element!

Life does not start and stop at your convenience

and, at, convenience, does, life, not, start, stop, your

donny, element, forget, it, out, of your, you’re

Page 26: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Big Social Data Analysis: A Case Study in Newtown

Forget it, Donny, you're out of your element!

Life does not start and stop at your convenience

convenience, does, life, start, stop

donny, element, forget, out

Stopwords - remove words which are so common as to be uninformative (e.g. pronouns, articles)

Page 27: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Big Social Data Analysis: A Case Study in Newtown

Stemming - the same verb with different conjugations and tenses should be represented in just one way

I run, he runs, we enjoy running. run, run, enjoy, run

You mark that frame an 8, and you're entering a world of pain.

mark, frame, 8, enter, world, pain

Page 28: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Smokey, this is not 'Nam. This is bowling. There are rules.

How come you don't roll on Saturday, Walter?

I don't roll on Shabbos!

Well, sir, it's this rug I had. It really tied the room together.

I need to see you. I'm the one who took your rug.

Walter, he peed on my rug!

Topic Model - some tweets discuss similar things, they ought to be grouped together

Big Social Data Analysis: A Case Study in Newtown

Page 29: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Idea: Posit a number of latent “topics,” then estimate the relationship between words-in-topics, and topics-in-tweets.

Tweet n

Tweet 2…

TopicsWords/Terms

Documents aka: Tweets

Bowling

Rug

Big Social Data Analysis: A Case Study in Newtown

smokey

saturday

shabbos!

rug

Tweet 1'nambowling

rules

took needroll

walter

how

come

peed

Page 30: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

With the twitter data, we’ll do a Topic model with 3 topics.

Topic # 1

This topic seems to capture

discussions of gun control and gun

rights as this political issue

emerged in the conversation.

Top Words Example Tweets wow, this shooting shit needs to stop.

#guncontrol now.

oh good. #obamao has put #biden in charge of #guncontrol. that makes me feel

all better about my rights and liberties.

gun control is like trying to reduce drunk driving by making it harder for sober

people to buy cars! #tcot #pjnet

#tcot#tlot

#nra

#2ndamendment#p2

gun #gun

control #newtown

#guncontrol #2a

rightamericaarmed

obamaassault

Big Social Data Analysis: A Case Study in Newtown

Page 31: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

With the twitter data, we’ll do a Topic model with 3 topics.

This topic seems to capture general chatter, commonly used words, and

spam tweets.

senate#tcot

#tlotvideo

obama

house tell like

free fiscal taxnews

need via

help @newhampshirecr reach 600 followers! only 7 more to go! #nhcr #nhpolitics #tcot

rt @msegieda: @foxnews writes about @fracknation's premiere tonight on @axstv 9 pm et http://t.co/paafwgux #fracking #tcot #tlot #tp

michigan man, dog rescued after ice breaks: http://t.co/cc983uyv #tcot

Topic # 2 Top Words Example Tweets

With the twitter data, we’ll do a Topic model with 3 topics.

Big Social Data Analysis: A Case Study in Newtown

Page 32: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Big Social Data Analysis: A Case Study in Newtown

With the twitter data, we’ll do a Topic model with 3 topics.

Topic # 3 Top Words Example Tweets

This topic seems to capture

descriptions of the tragedy as well as

expressions of sympathy, and

sadness.

my heart goes out to everyone affected by the shooting at #sandycook elementary. a senseless tragedy. i can't imagine your pain. #newtown

dear god, please protect our babies from the monsters that live among us. my heart is breaking for those in #newtown

thoughts are with the students of #sandyhook #newtown sad situation

#prayfornewtown#newtown

#ctshootingchildren

schoolfamilies

prayers thoughtsvictims sad

tragedykids god

little

today

lanzarip

Page 33: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

• With a topic model, we can extract topics that make intuitive sense!

• But what about the usage patterns of these topics?!

• Are there interesting temporal, ideological, or geographical trends/patterns?

Big Social Data Analysis: A Case Study in Newtown

Page 34: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Big Social Data Analysis: A Case Study in Newtown

12/14 Newtown Shootings

Cou

nt

Time/Date

12/16Obama openly wepts as he addressed the nation in the hours after the attack -- and stated that now was the time for "meaningful action" on gun violence.

12/19“In the coming weeks, I will use whatever power this office holds to engage my fellow citizens, from law enforcement to mental health professionals to parents and educators, in an effort aimed at preventing more tragedies like this," Obama said. "Because what choice do we have? We can't accept events like this as routine."

1/19Obama Presents Gun Control Agenda; Includes 23 Executive Orders

GunControlMixedSympathy

Page 35: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Big Social Data Analysis: A Case Study in Newtown

Page 36: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Big Social Data Analysis: A Case Study in NewtownEveryone

Page 37: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Big Social Data Analysis: A Case Study in Newtown

GunControl 2,733; 34.7%

Mixed 2,637; 33.5%

Sympathy 2,505; 31.8%

Everyone by Topic

Page 38: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Big Social Data Analysis: A Case Study in Newtown

Red: 2,179

Pink: 434

Purple: 887

Blue: 4,214

Light Blue: 161

Page 39: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

GunControl Mixed Sympathy

TopicsG

eogr

aphy

Blue

Red

Pink

PurpleLight Blue

10% 9.2% 8.5%

18.3% 17.2% 18%

Page 40: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Spatially Explicit Theory

Proximate casualty hypothesis; (Gartner, Segura, and Wilkening 1997)!Time and space provide new insight on the multiple processes underlying opinion change in today’s complex information environment. !A case study of the “proximate casualties” hypothesis, the idea that popular support for American wars is undermined at the individual level more by the deaths of American personnel from nearby areas than by the deaths of those from far away. 

http://jcr.sagepub.com/content/41/5/669.abstract

Page 41: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Page 42: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

New England

Mountain

Middle Atlantic

East South Central

East North Central

West South Central

West North Central

Pacific

South Atlantic

GunControl Mixed Sympathy

TopicsG

eogr

aphy

Page 43: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

SummaryCounting, counting, counting:

Why do we count? How do we count? Contrasting quantitative counting vs. qualitative counting. !

What is measurement? What are latent constructs? !Traditional vs. Nontraditional approaches to SocSci Inquiry.

Population vs. Sample. Model Organisms. Good data, Bad data. !

Case Study in Newtown using Twitter, Topic Modeling to explore temporal, ideological, and geographical elements.

Page 44: Big Data Analytics: Discovering Latent Structure in Twitter; A Case Study in the Tragic Aftermath of the Sandy Hook School Shooting

Richard Heimann © 2013

Richard Heimann: @rheimann https://twitter.com/rheimann [email protected] Data Tactics Big Data Insights [blog]: http://datatactics.blogspot.com !!Keegan Hines: @keeghin https://twitter.com/keeghin

Thank you… Questions?