big data and the social sciences

33
Big Data Technology and the Social Sciences: A Lecture at Mannheim University Abe Usher CCHP, CISSP Chief Technology Officer, HumanGeo

Upload: abe-usher

Post on 01-Dec-2014

97 views

Category:

Data & Analytics


0 download

DESCRIPTION

Big Data and the Social Sciences Ex-Google engineer Abe Usher presents a talk about Big Data technology and methods applicable to social science. Participants will learn techniques that are used by Google engineers to collect, clean, analyze, and visualize Big Data. Additionally Mr. Usher will provide URLs to sample data, open source applications, and code to those interested in applying these Big Data methods themselves.

TRANSCRIPT

Page 1: Big Data and the Social Sciences

Big Data Technologyand the Social Sciences:

A Lecture at Mannheim University

Abe Usher CCHP, CISSP Chief Technology Officer, HumanGeo

Page 2: Big Data and the Social Sciences

2

What’s In It For You?

Theory•Definitions and overview

•Where data are being generated

Practice•Google’s three secret techniques* for unlocking insights from data

•The kitchen model

•Recommended resources to build data science skills

Presentation slides: http://www.slideshare.net/abeusher/big-data-and-the-social-sciences

*Not specifically endorsed by Google. Also, not really a secret.

Page 3: Big Data and the Social Sciences

3

Background

HumanGeo is focused on digital Human Geography:

Understanding the location attributes of individuals and groups

And the social attributes of locations

Through ‘Big Data’ analysis of billions geolocated data elements

Page 4: Big Data and the Social Sciences

4

Big Data Wake-Up Call

Berkeley University Research http://goo.gl/zjSUr1

By 2016 the rate of data growth surpasses the rate of Moore’s Law

Page 6: Big Data and the Social Sciences

6

Big Data Definition

Boring Traditional definition

“High volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

Page 7: Big Data and the Social Sciences

7

Big Data Definition

Abe’s definition:

Page 8: Big Data and the Social Sciences

8

The Original “Big Data”

1880 US Census•50 million people

•Data included: age, gender, number of insane people in household*

•Took 7 years to tabulate

•1890 Census estimated at 13 years to complete

*Credit to Ken Krugler for this factoid: http://www.censusrecords.com/content/1880_census

Page 9: Big Data and the Social Sciences

9

The Original “Big Data”

1880 US Census•50 million people

•Data included: age, gender, number of insane people in household*

•Took 7 years to tabulate

•1890 Census estimated at 13 years to complete

1890•63 million people

•Additional data: citizenship and military service

•New technology: Hollerith Tabulating System

•Took 6 weeks to tabulate (76x faster)

Takeaway• Better technology and

methodology led to 76x speedup

*Credit to Ken Krugler for this factoid: http://www.censusrecords.com/content/1880_census

Page 10: Big Data and the Social Sciences

10

Data Generation

Where are data created?•Website interaction logs

•Social Media

•Cyber events

•Smartphones

What is the volume?•3B phone calls in USA

•700M Facebook posts

•500M tweets per day

•50B WhatsApp messages per day

Takeaway• Social media,

telecommunication, and instant messaging generate an increasingly high volume of data

Page 11: Big Data and the Social Sciences

11

Traditional Modelof Interpreting Observations

Tracy Morrow (aka “Ice T”)

How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)?

http://www.npr.org/2005/08/30/4824690/original-gangster-rapper-and-actor-ice-t

Page 12: Big Data and the Social Sciences

12

Tracy Morrow (aka “Ice T”)

How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)?

“Game knows game, baby.”

Traditional Modelof Interpreting Observations

Page 13: Big Data and the Social Sciences

13

Tracy Morrow (aka “Ice T”)

How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)?

“If you have expert knowledge, then you are capable of answering complex questions by interpreting domain specific information.” [paraphrased]

Traditional Modelof Interpreting Observations

Page 14: Big Data and the Social Sciences

Trust Models for complex data

• August Gorman carried out a plot to grab fractions of a penny from a corporate payroll system. http://goo.gl/vAScel

14

IMDB: 4.9/10Rotten Tomatoes: 26/100

Page 15: Big Data and the Social Sciences

Trust Models for complex data

• Peter Gibbons hatches a plot to write a computer virus that grab fractions of a penny from a corporate retirement account. http://goo.gl/rDg1U

• Known in security circles as a salami attack.

15

IMDB: 7.9/10Rotten Tomatoes: 79/100

Takeaway point: Little bits of value (information) provide deep insights in the aggregate

Page 16: Big Data and the Social Sciences

16

1. Aggregation

2. Visualization

3. Correlation

New Models of Interpreting (Big) Data

Takeaways• Expert based knowledge is no

longer sufficient.• Simple mathematical methods

create value from captured data

Page 17: Big Data and the Social Sciences

17

Aggregation(Counting)

William Thomson, 1st Baron Kelvin

"When you can measure what you are speaking about, and express it in

numbers, you know something about it.”

Takeaway• Aggregation via counting

things is the most common way to exploit Big Data

Page 18: Big Data and the Social Sciences

The book “Fearless” is much more popular than the 80s movie “Navy Seals.”It also has a more favorable distribution of reviews.

Aggregation:A Tale of Two Products

Page 19: Big Data and the Social Sciences

The distribution we’re looking for looks like the #1 hand:Responses concentrated in the most positive category,With very few responses that were unfavorable.

Aggregation:A Tale of Two Products

Page 20: Big Data and the Social Sciences

Aggregation & Visualization:Counting with Google Trends

Page 21: Big Data and the Social Sciences

Aggregation & Visualization:Bing Search vs. Google Search

Page 22: Big Data and the Social Sciences

Aggregation:Diet Pepsi vs. Diet Coke

Page 23: Big Data and the Social Sciences

Aggregation & Visualization:Big Data vs. Britney Spears

Page 24: Big Data and the Social Sciences

Geospatial Visualization Example:Social Drift in DC

Takeaway• Visualization provides a

powerful mechanism for Exploratory Data Analysis

A

Page 25: Big Data and the Social Sciences

25

Correlation:Canadian Flu Research

Gunther Eysenbach•Professor @ University of Toronto

•Focused on eHealth

•Google Ads user

Infodemiology•2004-2005 tracked flu related searches

•54,507 Ad impressions in Canada

•High R^2 correlation to actual flu activity

http://gunther-eysenbach.blogspot.com/

Infodemiology paper: http://goo.gl/aeUZtA

Takeaway• Human behavior in response

to Google Ads related to the flu was highly correlated with “officially reported” cases of the flu.

Page 26: Big Data and the Social Sciences

26

Correlation:Google Flu Trends

“Google Flu Trends provides near real-time estimates of flu activity for a number of countries and regions around the world based on aggregated search queries.”

Process•Map searches to regions

•Quantify “normal”

•Detect “anomalies”

NPR: http://goo.gl/Iv7A87

NYT: http://goo.gl/mNyAi7

Page 27: Big Data and the Social Sciences

27

Correlation:Box Office Hit Prediction

“Use of socially generated ‘big data’ to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science.”

Simple factors•number of total page views

•number of total edits made

•number of users editing

•number of revisions in the article's revision history

Early Prediction of Movie Box Office Success: http://goo.gl/BWf7H1

Counts of Wikipedia factors correlate to Box Office sales

Page 28: Big Data and the Social Sciences

28

Big Data:Significance for Social Sciences

1. Proxy variables.Digital exhaust collected for purposes other than survey often creates ‘proxy variables’ that provide complementary insights.

2. Aggregation Insights.Combining many small observations leads to insights that we can trust.

3. Data Linking.It is possible to ‘link’ or synchronize records between digital exhaust and instrumented surveys by selecting a common dimension (e.g. location).

The future of social science will involve combining “fuzzy Big Data insights” with instrumented survey results

Page 29: Big Data and the Social Sciences

Correlation Does Not Equal Causation

http://xkcd.com/552/

Page 30: Big Data and the Social Sciences

Chef Ingredients Utensils Recipes

The kitchen model of value creation

YourStaff

YourData

Technology Techniques

Page 31: Big Data and the Social Sciences

31

Take Action:Experiment yourself

Exploratory Data Analysis lifecycle:• collect - Twitter API, Datasift.com• clean - open refine• analyze - Python or R• visualize - Google Earth

Related data: https://s3.amazonaws.com/devbackup/germany.txt.gz

Related code: https://github.com/abeusher

Page 32: Big Data and the Social Sciences

32

Take Action: Explore

Google Trends http://goo.gl/8eJZg Google Ngram http://goo.gl/4U09fa

Google Correlate http://goo.gl/nEhe8D Bing Keyword Research http://goo.gl/q2V88g

Page 33: Big Data and the Social Sciences

33

Contact information

Abe Usher

Email: [email protected] Twitter: @abeusherLinkedIn: http://goo.gl/DUxZOP Presentations: http://goo.gl/bCa3Qt