the social science data revolutiongking.harvard.edu/files/gking/files/evbase-horizonsp.pdfsocial...

24
The Social Science Data Revolution Gary King Department of Government Harvard University (Horizons in Political Science talk, Government Department, Harvard University, 3/30/11) Gary King (Harvard, Gov Dept) The Data Revolution 1 / 24

Upload: others

Post on 20-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

The Social Science Data Revolution

Gary King

Department of GovernmentHarvard University

(Horizons in Political Science talk, Government Department, Harvard University, 3/30/11)

Gary King (Harvard, Gov Dept) The Data Revolution 1 / 24

The Changing Evidence Base of Social Science Research

The Last 50 Years:

Survey research

Aggregate government statistics

In depth studies of individual places, people, or events

The Next 50 Years: Spectacular increases in new data sources, due to. . .

Much more of the above — improved, expanded, and applied

Shrinking computers & the growing Internet: data everywhere

The replication movement: academic data sharing (e.g., Dataverse)

Governments encouraging data collection, distribution,experimentation (e.g., GovData)

Advances in statistical methods, informatics, & software

The march of quantification: through academia, professions,government, & commerce (SuperCrunchers, The Numerati,MoneyBall)

Gary King (Harvard, Gov Dept) The Data Revolution 2 / 24

Examples of what’s now possible

Opinions of activists: A few thousand interviews millions ofpolitical opinions in social media posts (1B tweets/week)

Exercise: A survey: “How many times did you exercise last week? 500K people carrying cell phones with accelerometers

Social contacts: A survey: “Please tell me your 5 best friends” continuous record of phone calls, emails, text messages, bluetooth,social media connections, electronic address books

Economic development in developing countries: Dubious ornonexistent governmental statistics satellite images ofhuman-generated light at night, or networks of roads and otherinfrastructure

Expert-vs-Statistician contests: Whenever enough information isquantified (& a right answer exists), stats wins every time

Many, many more. . .

Gary King (Harvard, Gov Dept) The Data Revolution 3 / 24

How to make progress in the new data-rich world?

1 Computer-assisted methods: Traditional quantitative-only orqualitative-only approaches are infeasible

2 Large-scale, interdisciplinary, collaborative research

3 New statistical methods & engineering required

4 Better theory: to respond to massive new evidence, privacychallenges, data-driven science

Bigger changes in the practice of social science then ever before

Gary King (Harvard, Gov Dept) The Data Revolution 4 / 24

Two Examples

of Automated Text Analysis

Gary King (Harvard, Gov Dept) The Data Revolution 5 / 24

Example 1: How to Read a Billion Blog Posts(& Classify Deaths without Physicians)

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text” AJPS, commercialized via:

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” Statistical Science used by (among others):

Gary King (Harvard, Gov Dept) The Data Revolution 6 / 24

Data and Quantities of Interest

Input Data:

Large set of text documents (e.g., all English language blog posts)Categories (posts about US candidates): extremely negative, negative,neutral, positive, extremely positive, no opinion, not a blogA small “training set” of documents hand-coded into the categories

Quantities of interest

Computer science: individual document classifications (spam filters,Google searches)Social Science: proportion in each category (proportion of email whichis spam; proportion extremely negative comment about Pres Bush)

Estimation

Can get the 2nd by counting the 1st (if 1st is accurate)High classification accuracy ; unbiased category proportions70% classification accuracy is high ⇒ disaster for category proportionsNew methodology: unbiased category proportions, even whenclassification accuracy is low

Gary King (Harvard, Gov Dept) The Data Revolution 7 / 24

Out-of-sample Comparison: 60 Seconds vs. 8.7 Days

●● ●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

●●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

Gary King (Harvard, Gov Dept) The Data Revolution 8 / 24

Reactions to John Kerry’s Botched Joke

You know, education — if you make the most of it . . . you cando well. If you don’t, you get stuck in Iraq.

●●

●● ●

●●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

2

Gary King (Harvard, Gov Dept) The Data Revolution 9 / 24

Example 2: Computer-Assisted “Reading”

Justin Grimmer and Gary King. 2011. “General-Purpose Clusteringand Conceptualization” Proceedings of the National Academy ofSciences.

Conceptualization through Classification: “one of the most centraland generic of all our conceptual exercises. . . .Without classification,there could be no advanced conceptualization, reasoning, language,data analysis or,for that matter, social science research.” (Bailey,1994).

Cluster Analysis: simultaneously (1) invents categories and (2)assigns documents to categories

Gary King (Harvard, Gov Dept) The Data Revolution 10 / 24

What’s Hard about Clustering?(Why Johnny Can’t Classify)

Goal: Computer-assisted conceptualization & clustering

Bell(n) = number of ways of partitioning n objects

Bell(2) = 2 (AB, A B)

Bell(3) = 5 (ABC, AB C, A BC, AC B, A B C)

Bell(5) = 52

Bell(100) ≈ 1028 × Number of elementary particles in the universe

Now imagine choosing the optimal classification scheme by hand!

Available compromises pursue different goals:

Standard Approach: Fully automated methods no method workswell in general; impossible to know which to apply!Our Approach: Computer-assisted methods You, not somecomputer algorithm, decides what’s important, but with help

Gary King (Harvard, Gov Dept) The Data Revolution 11 / 24

Switch from Fully Automated to Computer Assisted

Computer-Assisted Clustering

Easy in theory: list all clusterings; choose the bestImpossible in practice: Too hard for us mere humans!An organized list will make the search possibleInsight: Many clusterings are perceptually identicalE.g.,: consider two clusterings that differ only because one document(of 10,000) moves from category 5 to 6

Question: How to organize clusterings so humans can understand?

Gary King (Harvard, Gov Dept) The Data Revolution 12 / 24

How to Zoom Out while ReadingYou choose one (or more) clustering, based on insight, discovery, useful information,. . .

Space of Clusterings

biclust_spectral

clust_convex

mult_dirproc

dismea

rock

som

spec_cos spec_eucspec_man

spec_mink

spec_max

spec_canb

mspec_cos

mspec_euc

mspec_man

mspec_mink

mspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euc

kmedoids euclidean

kmedoids manhattan

mixvmf

mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattan

kmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearman

kmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean average

hclust euclidean mcquitty

hclust euclidean median

hclust euclidean centroidhclust maximum ward

hclust maximum single

hclust maximum completehclust maximum averagehclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan averagehclust manhattan mcquittyhclust manhattan median

hclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquittyhclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquitty

hclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson complete

hclust pearson averagehclust pearson mcquitty

hclust pearson medianhclust pearson centroid

hclust correlation ward

hclust correlation single

hclust correlation complete

hclust correlation averagehclust correlation mcquitty

hclust correlation medianhclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman average

hclust spearman mcquitty

hclust spearman median

hclust spearman centroid

hclust kendall ward

hclust kendall single

hclust kendall complete

hclust kendall average

hclust kendall mcquitty

hclust kendall median

hclust kendall centroid

Clustering 1

Carter

Clinton

Eisenhower

Ford

Johnson

Kennedy

Nixon

Obama

Roosevelt

Truman

Bush

HWBush

Reagan

``Other Presidents ''

``Reagan Republicans''

Clustering 2

Carter

Eisenhower

Ford

Johnson

Kennedy

Nixon

Roosevelt

Truman

Bush

Clinton

HWBush

Obama

Reagan

``RooseveltTo Carter''

`` Reagan To Obama ''

Gary King (Harvard, Gov Dept) The Data Revolution 13 / 24

Evaluation: More Informative Discoveries

Found 2 scholars analyzing lots of textual data for their work

Created 6 clusterings:

2 clusterings selected with our method (biased against us)2 clusterings from each of 2 other methods (varying tuning parameters)

Created info packet on each clustering (for each cluster: exemplardocument, automated content summary)

Asked for(62

)=15 pairwise comparisons

User chooses ⇒ only care about the one clustering that wins

Both cases a Condorcet winner:

“Immigration”:

Our Method 1 → vMF 1 → vMF 2 → Our Method 2 → K-Means 1 → K-Means 2

“Genetic testing”:

Our Method 1 → {Our Method 2, K-Means 1, K-means 2} → Dir Proc. 1 → Dir Proc. 2

Gary King (Harvard, Gov Dept) The Data Revolution 14 / 24

Evaluation: What Do Members of Congress Do?

- David Mayhew’s (1974) famous typology

- Advertising- Credit Claiming- Position Taking

- Data: 200 press releases from Frank Lautenberg’s office (D-NJ)

- Apply our method

Gary King (Harvard, Gov Dept) The Data Revolution 15 / 24

Example Discovery

: Partisan Taunting

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

●●

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

●●

●●

●●

AdvertisingAdvertising

biclust_spectral

clust_convex

mult_dirproc

dismeadist_cosdist_fbinarydist_ebinarydist_minkowskidist_maxdist_canbdist_binary

mec

rocksom

sot_euc

sot_cor

spec_cosspec_eucspec_man

spec_minkspec_maxspec_canbmspec_cosmspec_euc

mspec_manmspec_minkmspec_max

mspec_canb

affprop cosine

affprop euclidean

affprop manhattan

affprop info.costs

affprop maximum

divisive stand.euc

divisive euclidean

divisive manhattan

kmedoids stand.euckmedoids euclidean

kmedoids manhattan

mixvmf mixvmfVA

kmeans euclidean

kmeans maximum

kmeans manhattankmeans canberra

kmeans binary

kmeans pearson

kmeans correlation

kmeans spearmankmeans kendall

hclust euclidean ward

hclust euclidean single

hclust euclidean complete

hclust euclidean averagehclust euclidean mcquitty

hclust euclidean medianhclust euclidean centroid

hclust maximum ward

hclust maximum single

hclust maximum completehclust maximum average

hclust maximum mcquitty

hclust maximum medianhclust maximum centroid

hclust manhattan ward

hclust manhattan single

hclust manhattan complete

hclust manhattan average

hclust manhattan mcquitty

hclust manhattan medianhclust manhattan centroid

hclust canberra ward

hclust canberra single

hclust canberra complete

hclust canberra average

hclust canberra mcquitty

hclust canberra median

hclust canberra centroid

hclust binary ward

hclust binary single

hclust binary complete

hclust binary average

hclust binary mcquittyhclust binary median

hclust binary centroid

hclust pearson ward

hclust pearson single

hclust pearson completehclust pearson average

hclust pearson mcquittyhclust pearson median

hclust pearson centroid

hclust correlation ward

hclust correlation single hclust correlation completehclust correlation average

hclust correlation mcquitty

hclust correlation median

hclust correlation centroid

hclust spearman ward

hclust spearman single

hclust spearman complete

hclust spearman averagehclust spearman mcquitty

hclust spearman medianhclust spearman centroid

hclust kendall ward

hclust kendall singlehclust kendall complete

hclust kendall averagehclust kendall mcquittyhclust kendall medianhclust kendall centroid

Clusters in this Clustering

Mayhew

●● ●

●●

●●

●●

●●

Credit ClaimingPork

● ●●

●●

●●

●●

Credit ClaimingLegislation

●●

●●

●●

AdvertisingAdvertising

●●

●●

●●

● ●

●●

●●

●●

● ●

Partisan Taunting

The space of clusterings Found a regionwith particularly insightful clusteringsLet’s look at one clustering in thisregion Credit Claiming, Pork:“Sens. Frank R. Lautenberg (D-NJ)and Robert Menendez (D-NJ)announced that the U.S. Department ofCommerce has awarded a $100,000grant to the South Jersey EconomicDevelopment District” Credit Claiming,Legislation:“As the Senate begins its recess,Senator Frank Lautenberg todaypointed to a string of victories inCongress on his legislative agendaduring this work period” Advertising:“Senate Adopts Lautenberg/MenendezResolution Honoring Spelling BeeChampion from New Jersey”Partisan Taunting:“Republicans Selling Out Nation onChemical Plant Security” “SenatorLautenberg’s amendment would changethe name of. . . the Republican bill. . . to‘More Tax Breaks for the Rich andMore Debt for Our GrandchildrenDeficit Expansion Reconciliation Act of2006”’ Definition of Partisan Taunting:Explicit, public, and negative attacks onanother political party or its members

Taunting ruinsdeliberation

Gary King (Harvard, Gov Dept) The Data Revolution 16 / 24

In Sample Illustration of Partisan Taunting

Taunting ruins deliberation

Sen. Lautenbergon Senate Floor4/29/04

- “Senator Lautenberg BlastsRepublicans as ‘Chicken Hawks’ ”[Government Oversight]

- “The scopes trial took place in1925. Sadly, President Bush’s vetotoday shows that we haven’tprogressed much since then”[Healthcare]

- “Every day the House Republicansdragged this out was a day thatmade our communities lesssafe.”[Homeland Security]

Gary King (Harvard, Gov Dept) The Data Revolution 17 / 24

Out of Sample Confirmation of Partisan Taunting

- Discovered using 200 press releases; 1 senator.

- Confirmed using 64,033 press releases; 301 senator-years.- Apply supervised learning method: measure proportion of press

releases a senator taunts other party

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

Gary King (Harvard, Gov Dept) The Data Revolution 18 / 24

Out of Sample Confirmation of Partisan Taunting

- Discovered using 200 press releases; 1 senator.- Confirmed using 64,033 press releases; 301 senator-years.- Apply supervised learning method: measure proportion of press

releases a senator taunts other party

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

Prop. of Press Releases Taunting

Fre

quen

cy

0.1 0.2 0.3 0.4 0.5

1020

30

On Avg., Senators Taunt in 27 % of Press Releases

Gary King (Harvard, Gov Dept) The Data Revolution 19 / 24

Normative Implications of Taunting

Partisan taunting:

Very commonMakes deliberation less likelyOccurs more often in homogeneously partisan districts (i.e., whenpreaching to the choir)

Incompatibility of the principles of representative democracy

To get reflection: Homogeneous (noncompetitive) districtsTo get deliberation (no taunting): Heterogeneous (competitive)districts you can’t have both!

Gary King (Harvard, Gov Dept) The Data Revolution 20 / 24

Some New Data Types

1 Unstructured text: emails (1 LOC every 10 minutes), speeches,government reports, blogs, social media updates, web pages,newspapers, scholarly literature

2 Commercial activity: credit cards, sales data, and real estatetransactions, product RFIDs

3 Geographic location: cell phones, Fastlane or EZPass transponders,garage cameras

4 Health information: digital medical records, hospital admittances,google/MS health, and accelerometers and other devices beingincluded in cell phones

5 Biological sciences: effectively becoming social sciences as genomics,proteomics, metabolomics, and brain imaging produce huge numbersof person-level variables.

6 Satellite imagery: increasing in scope, resolution, and availability.7 Electoral activity: ballot images, precinct-level results, individual-level

registration, primary participation, and campaign contributions

Gary King (Harvard, Gov Dept) The Data Revolution 21 / 24

Some More New Data Examples

8 Social media: facebook, twitter, social bookmarking, blog comments,product reviews, virtual worlds, game behavior, crowd sourcing

9 Web surfing artifacts: clicks, searches, and advertising clickthroughs.(Google collects 1 petabyte/72 minutes on human behavior!)

10 Multiplayer web games and virtual worlds: Billions of highlycontrolled experiments on human behavior

11 Government bureaucracies: moving from paper to electronic databases, increasing availability

12 Governmental policies: requiring more data collection, such e.g., “NoChild Left Behind Act”; allowing randomized policy experiments;Obama pushing data distribution

13 Scholarly data: the replication movement in academia, led in part bypolitical science, is massively increasing data sharing

Gary King (Harvard, Gov Dept) The Data Revolution 22 / 24

Enormous Emerging Opportunities for Social Scientists

For the first time: technologies, policies, data, and methods aremaking it feasible to attack some of the most vexing problems thatafflict human society

A massive change from studying problems to understanding andsolving problems

Opportunities require a change in our job descriptions, with new:1 Computer-assisted methods: Traditional quantitative-only or

qualitative-only approaches are infeasible2 Large-scale, interdisciplinary, collaborative research3 New statistical methods & engineering required4 Better theory: to respond to massive new evidence, privacy challenges,

data-driven science

And then there’s you & me:In most legislatures, courts, academic departments, . . . , change comesfrom replacement not conversionWill we wait to be replaced? or put in the effort to convert and learnhow to use the new information?

Gary King (Harvard, Gov Dept) The Data Revolution 23 / 24

For more information

http://GKing.Harvard.edu

Gary King (Harvard, Gov Dept) The Data Revolution 24 / 24