how to read 100 million blogs (& classify deaths without ... · prominent early social...

140
How to Read 100 Million Blogs (& Classify Deaths Without Physicians) Gary King Harvard University April 11, 2007 Gary King Harvard University () How to Read 100 Million Blogs (& Classify Deaths Without Physicians) April 11, 2007 1 / 33

Upload: others

Post on 13-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

How to Read 100 Million Blogs(& Classify Deaths Without Physicians)

Gary KingHarvard University

April 11, 2007

Gary King Harvard University () How to Read 100 Million Blogs (& Classify Deaths Without Physicians)April 11, 2007 1 / 33

Page 2: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

References

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text”

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” tentatively to appear, Statistical Science

Copies at http://gking.harvard.edu

Gary King (Harvard) Content Analysis 2 / 33

Page 3: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

References

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text”

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” tentatively to appear, Statistical Science

Copies at http://gking.harvard.edu

Gary King (Harvard) Content Analysis 2 / 33

Page 4: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

References

Daniel Hopkins and Gary King. “Extracting Systematic Social ScienceMeaning from Text”

Gary King and Ying Lu. “Verbal Autopsy Methods with MultipleCauses of Death,” tentatively to appear, Statistical Science

Copies at http://gking.harvard.edu

Gary King (Harvard) Content Analysis 2 / 33

Page 5: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Content Analysis: Past and Future

Dates to the 1600s: The Church tracked nonreligious texts byclassifying newspaper stories

Prominent early social scientists used it: Berelson, de Grazia, etc.

Spread to vast array of fields (use increased six-fold 1980–2000)

New applications: explosive increase in web pages, blogs, emails,digitized books and articles, audio recordings (automaticallyconverted to text), and government reports, legislative hearings andrecords, electronic medical records, etc.

Infeasible to expand hand coding efforts much further

Automated methods are essential

Gary King (Harvard) Content Analysis 3 / 33

Page 6: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Content Analysis: Past and Future

Dates to the 1600s: The Church tracked nonreligious texts byclassifying newspaper stories

Prominent early social scientists used it: Berelson, de Grazia, etc.

Spread to vast array of fields (use increased six-fold 1980–2000)

New applications: explosive increase in web pages, blogs, emails,digitized books and articles, audio recordings (automaticallyconverted to text), and government reports, legislative hearings andrecords, electronic medical records, etc.

Infeasible to expand hand coding efforts much further

Automated methods are essential

Gary King (Harvard) Content Analysis 3 / 33

Page 7: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Content Analysis: Past and Future

Dates to the 1600s: The Church tracked nonreligious texts byclassifying newspaper stories

Prominent early social scientists used it: Berelson, de Grazia, etc.

Spread to vast array of fields (use increased six-fold 1980–2000)

New applications: explosive increase in web pages, blogs, emails,digitized books and articles, audio recordings (automaticallyconverted to text), and government reports, legislative hearings andrecords, electronic medical records, etc.

Infeasible to expand hand coding efforts much further

Automated methods are essential

Gary King (Harvard) Content Analysis 3 / 33

Page 8: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Content Analysis: Past and Future

Dates to the 1600s: The Church tracked nonreligious texts byclassifying newspaper stories

Prominent early social scientists used it: Berelson, de Grazia, etc.

Spread to vast array of fields (use increased six-fold 1980–2000)

New applications: explosive increase in web pages, blogs, emails,digitized books and articles, audio recordings (automaticallyconverted to text), and government reports, legislative hearings andrecords, electronic medical records, etc.

Infeasible to expand hand coding efforts much further

Automated methods are essential

Gary King (Harvard) Content Analysis 3 / 33

Page 9: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Content Analysis: Past and Future

Dates to the 1600s: The Church tracked nonreligious texts byclassifying newspaper stories

Prominent early social scientists used it: Berelson, de Grazia, etc.

Spread to vast array of fields (use increased six-fold 1980–2000)

New applications: explosive increase in web pages, blogs, emails,digitized books and articles, audio recordings (automaticallyconverted to text), and government reports, legislative hearings andrecords, electronic medical records, etc.

Infeasible to expand hand coding efforts much further

Automated methods are essential

Gary King (Harvard) Content Analysis 3 / 33

Page 10: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Content Analysis: Past and Future

Dates to the 1600s: The Church tracked nonreligious texts byclassifying newspaper stories

Prominent early social scientists used it: Berelson, de Grazia, etc.

Spread to vast array of fields (use increased six-fold 1980–2000)

New applications: explosive increase in web pages, blogs, emails,digitized books and articles, audio recordings (automaticallyconverted to text), and government reports, legislative hearings andrecords, electronic medical records, etc.

Infeasible to expand hand coding efforts much further

Automated methods are essential

Gary King (Harvard) Content Analysis 3 / 33

Page 11: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Content Analysis: Past and Future

Dates to the 1600s: The Church tracked nonreligious texts byclassifying newspaper stories

Prominent early social scientists used it: Berelson, de Grazia, etc.

Spread to vast array of fields (use increased six-fold 1980–2000)

New applications: explosive increase in web pages, blogs, emails,digitized books and articles, audio recordings (automaticallyconverted to text), and government reports, legislative hearings andrecords, electronic medical records, etc.

Infeasible to expand hand coding efforts much further

Automated methods are essential

Gary King (Harvard) Content Analysis 3 / 33

Page 12: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Inputs and Target Quantities of Interest

Available inputs:

Large set of text documentsA set of (mutually exclusive and exhaustive) categoriesA small subset of documents hand-coded into the categories

Quantities of interest

individual document classificationproportion of documents in each categoryCan get the 2nd by aggregating the 1st (turns out not to be necessary!)E.g., classify constituents’ letters to a member of congress by policyarea, or estimate proportion of letters in each policy areaE.g., classify emails as spam or not, or estimate proportion of emailthat is spam

Maximizing one goal won’t get you the other: high classificationaccuracy can coexist with huge biases in category proportions

Gary King (Harvard) Content Analysis 4 / 33

Page 13: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Inputs and Target Quantities of Interest

Available inputs:

Large set of text documentsA set of (mutually exclusive and exhaustive) categoriesA small subset of documents hand-coded into the categories

Quantities of interest

individual document classificationproportion of documents in each categoryCan get the 2nd by aggregating the 1st (turns out not to be necessary!)E.g., classify constituents’ letters to a member of congress by policyarea, or estimate proportion of letters in each policy areaE.g., classify emails as spam or not, or estimate proportion of emailthat is spam

Maximizing one goal won’t get you the other: high classificationaccuracy can coexist with huge biases in category proportions

Gary King (Harvard) Content Analysis 4 / 33

Page 14: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Inputs and Target Quantities of Interest

Available inputs:

Large set of text documents

A set of (mutually exclusive and exhaustive) categoriesA small subset of documents hand-coded into the categories

Quantities of interest

individual document classificationproportion of documents in each categoryCan get the 2nd by aggregating the 1st (turns out not to be necessary!)E.g., classify constituents’ letters to a member of congress by policyarea, or estimate proportion of letters in each policy areaE.g., classify emails as spam or not, or estimate proportion of emailthat is spam

Maximizing one goal won’t get you the other: high classificationaccuracy can coexist with huge biases in category proportions

Gary King (Harvard) Content Analysis 4 / 33

Page 15: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Inputs and Target Quantities of Interest

Available inputs:

Large set of text documentsA set of (mutually exclusive and exhaustive) categories

A small subset of documents hand-coded into the categories

Quantities of interest

individual document classificationproportion of documents in each categoryCan get the 2nd by aggregating the 1st (turns out not to be necessary!)E.g., classify constituents’ letters to a member of congress by policyarea, or estimate proportion of letters in each policy areaE.g., classify emails as spam or not, or estimate proportion of emailthat is spam

Maximizing one goal won’t get you the other: high classificationaccuracy can coexist with huge biases in category proportions

Gary King (Harvard) Content Analysis 4 / 33

Page 16: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Inputs and Target Quantities of Interest

Available inputs:

Large set of text documentsA set of (mutually exclusive and exhaustive) categoriesA small subset of documents hand-coded into the categories

Quantities of interest

individual document classificationproportion of documents in each categoryCan get the 2nd by aggregating the 1st (turns out not to be necessary!)E.g., classify constituents’ letters to a member of congress by policyarea, or estimate proportion of letters in each policy areaE.g., classify emails as spam or not, or estimate proportion of emailthat is spam

Maximizing one goal won’t get you the other: high classificationaccuracy can coexist with huge biases in category proportions

Gary King (Harvard) Content Analysis 4 / 33

Page 17: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Inputs and Target Quantities of Interest

Available inputs:

Large set of text documentsA set of (mutually exclusive and exhaustive) categoriesA small subset of documents hand-coded into the categories

Quantities of interest

individual document classificationproportion of documents in each categoryCan get the 2nd by aggregating the 1st (turns out not to be necessary!)E.g., classify constituents’ letters to a member of congress by policyarea, or estimate proportion of letters in each policy areaE.g., classify emails as spam or not, or estimate proportion of emailthat is spam

Maximizing one goal won’t get you the other: high classificationaccuracy can coexist with huge biases in category proportions

Gary King (Harvard) Content Analysis 4 / 33

Page 18: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Inputs and Target Quantities of Interest

Available inputs:

Large set of text documentsA set of (mutually exclusive and exhaustive) categoriesA small subset of documents hand-coded into the categories

Quantities of interest

individual document classification

proportion of documents in each categoryCan get the 2nd by aggregating the 1st (turns out not to be necessary!)E.g., classify constituents’ letters to a member of congress by policyarea, or estimate proportion of letters in each policy areaE.g., classify emails as spam or not, or estimate proportion of emailthat is spam

Maximizing one goal won’t get you the other: high classificationaccuracy can coexist with huge biases in category proportions

Gary King (Harvard) Content Analysis 4 / 33

Page 19: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Inputs and Target Quantities of Interest

Available inputs:

Large set of text documentsA set of (mutually exclusive and exhaustive) categoriesA small subset of documents hand-coded into the categories

Quantities of interest

individual document classificationproportion of documents in each category

Can get the 2nd by aggregating the 1st (turns out not to be necessary!)E.g., classify constituents’ letters to a member of congress by policyarea, or estimate proportion of letters in each policy areaE.g., classify emails as spam or not, or estimate proportion of emailthat is spam

Maximizing one goal won’t get you the other: high classificationaccuracy can coexist with huge biases in category proportions

Gary King (Harvard) Content Analysis 4 / 33

Page 20: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Inputs and Target Quantities of Interest

Available inputs:

Large set of text documentsA set of (mutually exclusive and exhaustive) categoriesA small subset of documents hand-coded into the categories

Quantities of interest

individual document classificationproportion of documents in each categoryCan get the 2nd by aggregating the 1st (turns out not to be necessary!)

E.g., classify constituents’ letters to a member of congress by policyarea, or estimate proportion of letters in each policy areaE.g., classify emails as spam or not, or estimate proportion of emailthat is spam

Maximizing one goal won’t get you the other: high classificationaccuracy can coexist with huge biases in category proportions

Gary King (Harvard) Content Analysis 4 / 33

Page 21: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Inputs and Target Quantities of Interest

Available inputs:

Large set of text documentsA set of (mutually exclusive and exhaustive) categoriesA small subset of documents hand-coded into the categories

Quantities of interest

individual document classificationproportion of documents in each categoryCan get the 2nd by aggregating the 1st (turns out not to be necessary!)E.g., classify constituents’ letters to a member of congress by policyarea, or estimate proportion of letters in each policy area

E.g., classify emails as spam or not, or estimate proportion of emailthat is spam

Maximizing one goal won’t get you the other: high classificationaccuracy can coexist with huge biases in category proportions

Gary King (Harvard) Content Analysis 4 / 33

Page 22: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Inputs and Target Quantities of Interest

Available inputs:

Large set of text documentsA set of (mutually exclusive and exhaustive) categoriesA small subset of documents hand-coded into the categories

Quantities of interest

individual document classificationproportion of documents in each categoryCan get the 2nd by aggregating the 1st (turns out not to be necessary!)E.g., classify constituents’ letters to a member of congress by policyarea, or estimate proportion of letters in each policy areaE.g., classify emails as spam or not, or estimate proportion of emailthat is spam

Maximizing one goal won’t get you the other: high classificationaccuracy can coexist with huge biases in category proportions

Gary King (Harvard) Content Analysis 4 / 33

Page 23: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Inputs and Target Quantities of Interest

Available inputs:

Large set of text documentsA set of (mutually exclusive and exhaustive) categoriesA small subset of documents hand-coded into the categories

Quantities of interest

individual document classificationproportion of documents in each categoryCan get the 2nd by aggregating the 1st (turns out not to be necessary!)E.g., classify constituents’ letters to a member of congress by policyarea, or estimate proportion of letters in each policy areaE.g., classify emails as spam or not, or estimate proportion of emailthat is spam

Maximizing one goal won’t get you the other: high classificationaccuracy can coexist with huge biases in category proportions

Gary King (Harvard) Content Analysis 4 / 33

Page 24: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Our Approach

Gives unbiased estimates of population proportions

Works better than aggregating the best classification method

No problem if classification accuracy is low

(And individual classification is not necessary)

No parametric modeling assumptions

The hand coded subset need not be a random sample

Scales to large numbers of documents

Separately: propose correction for imperfect inter-coder reliability(i.e., should work better than hand coding everything if that werefeasible)

Gary King (Harvard) Content Analysis 5 / 33

Page 25: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Our Approach

Gives unbiased estimates of population proportions

Works better than aggregating the best classification method

No problem if classification accuracy is low

(And individual classification is not necessary)

No parametric modeling assumptions

The hand coded subset need not be a random sample

Scales to large numbers of documents

Separately: propose correction for imperfect inter-coder reliability(i.e., should work better than hand coding everything if that werefeasible)

Gary King (Harvard) Content Analysis 5 / 33

Page 26: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Our Approach

Gives unbiased estimates of population proportions

Works better than aggregating the best classification method

No problem if classification accuracy is low

(And individual classification is not necessary)

No parametric modeling assumptions

The hand coded subset need not be a random sample

Scales to large numbers of documents

Separately: propose correction for imperfect inter-coder reliability(i.e., should work better than hand coding everything if that werefeasible)

Gary King (Harvard) Content Analysis 5 / 33

Page 27: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Our Approach

Gives unbiased estimates of population proportions

Works better than aggregating the best classification method

No problem if classification accuracy is low

(And individual classification is not necessary)

No parametric modeling assumptions

The hand coded subset need not be a random sample

Scales to large numbers of documents

Separately: propose correction for imperfect inter-coder reliability(i.e., should work better than hand coding everything if that werefeasible)

Gary King (Harvard) Content Analysis 5 / 33

Page 28: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Our Approach

Gives unbiased estimates of population proportions

Works better than aggregating the best classification method

No problem if classification accuracy is low

(And individual classification is not necessary)

No parametric modeling assumptions

The hand coded subset need not be a random sample

Scales to large numbers of documents

Separately: propose correction for imperfect inter-coder reliability(i.e., should work better than hand coding everything if that werefeasible)

Gary King (Harvard) Content Analysis 5 / 33

Page 29: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Our Approach

Gives unbiased estimates of population proportions

Works better than aggregating the best classification method

No problem if classification accuracy is low

(And individual classification is not necessary)

No parametric modeling assumptions

The hand coded subset need not be a random sample

Scales to large numbers of documents

Separately: propose correction for imperfect inter-coder reliability(i.e., should work better than hand coding everything if that werefeasible)

Gary King (Harvard) Content Analysis 5 / 33

Page 30: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Our Approach

Gives unbiased estimates of population proportions

Works better than aggregating the best classification method

No problem if classification accuracy is low

(And individual classification is not necessary)

No parametric modeling assumptions

The hand coded subset need not be a random sample

Scales to large numbers of documents

Separately: propose correction for imperfect inter-coder reliability(i.e., should work better than hand coding everything if that werefeasible)

Gary King (Harvard) Content Analysis 5 / 33

Page 31: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Our Approach

Gives unbiased estimates of population proportions

Works better than aggregating the best classification method

No problem if classification accuracy is low

(And individual classification is not necessary)

No parametric modeling assumptions

The hand coded subset need not be a random sample

Scales to large numbers of documents

Separately: propose correction for imperfect inter-coder reliability(i.e., should work better than hand coding everything if that werefeasible)

Gary King (Harvard) Content Analysis 5 / 33

Page 32: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Our Approach

Gives unbiased estimates of population proportions

Works better than aggregating the best classification method

No problem if classification accuracy is low

(And individual classification is not necessary)

No parametric modeling assumptions

The hand coded subset need not be a random sample

Scales to large numbers of documents

Separately: propose correction for imperfect inter-coder reliability(i.e., should work better than hand coding everything if that werefeasible)

Gary King (Harvard) Content Analysis 5 / 33

Page 33: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Blogs as a Running Example

Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.

8% of U.S. Internet users (12 million) have blogs

Growth: ≈ 0 in 2000 to 39–100 million worldwide now.

A democratic technology: 6 million in China and 700,000 in Iran(!)

“We are living through the largest expansion of expressive capabilityin the history of the human race”

Gary King (Harvard) Content Analysis 6 / 33

Page 34: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Blogs as a Running Example

Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.

8% of U.S. Internet users (12 million) have blogs

Growth: ≈ 0 in 2000 to 39–100 million worldwide now.

A democratic technology: 6 million in China and 700,000 in Iran(!)

“We are living through the largest expansion of expressive capabilityin the history of the human race”

Gary King (Harvard) Content Analysis 6 / 33

Page 35: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Blogs as a Running Example

Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.

8% of U.S. Internet users (12 million) have blogs

Growth: ≈ 0 in 2000 to 39–100 million worldwide now.

A democratic technology: 6 million in China and 700,000 in Iran(!)

“We are living through the largest expansion of expressive capabilityin the history of the human race”

Gary King (Harvard) Content Analysis 6 / 33

Page 36: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Blogs as a Running Example

Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.

8% of U.S. Internet users (12 million) have blogs

Growth: ≈ 0 in 2000 to 39–100 million worldwide now.

A democratic technology: 6 million in China and 700,000 in Iran(!)

“We are living through the largest expansion of expressive capabilityin the history of the human race”

Gary King (Harvard) Content Analysis 6 / 33

Page 37: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Blogs as a Running Example

Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.

8% of U.S. Internet users (12 million) have blogs

Growth: ≈ 0 in 2000 to 39–100 million worldwide now.

A democratic technology: 6 million in China and 700,000 in Iran(!)

“We are living through the largest expansion of expressive capabilityin the history of the human race”

Gary King (Harvard) Content Analysis 6 / 33

Page 38: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Blogs as a Running Example

Blogs (web logs): web version of a daily diary, with posts listed inreverse chronological order.

8% of U.S. Internet users (12 million) have blogs

Growth: ≈ 0 in 2000 to 39–100 million worldwide now.

A democratic technology: 6 million in China and 700,000 in Iran(!)

“We are living through the largest expansion of expressive capabilityin the history of the human race”

Gary King (Harvard) Content Analysis 6 / 33

Page 39: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

One specific quantity of interest

Subject: the grand conversation about the American presidency

Question: affect about President Bush and 2008 candidates

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Language ranges from “my crunchy gf thinks dubya hid the wmd’s, :)!’to the Queen’s EnglishLittle common internal structure (no inverted pyramid)

Gary King (Harvard) Content Analysis 7 / 33

Page 40: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

One specific quantity of interest

Subject: the grand conversation about the American presidency

Question: affect about President Bush and 2008 candidates

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Language ranges from “my crunchy gf thinks dubya hid the wmd’s, :)!’to the Queen’s EnglishLittle common internal structure (no inverted pyramid)

Gary King (Harvard) Content Analysis 7 / 33

Page 41: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

One specific quantity of interest

Subject: the grand conversation about the American presidency

Question: affect about President Bush and 2008 candidates

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Language ranges from “my crunchy gf thinks dubya hid the wmd’s, :)!’to the Queen’s EnglishLittle common internal structure (no inverted pyramid)

Gary King (Harvard) Content Analysis 7 / 33

Page 42: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

One specific quantity of interest

Subject: the grand conversation about the American presidency

Question: affect about President Bush and 2008 candidates

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Language ranges from “my crunchy gf thinks dubya hid the wmd’s, :)!’to the Queen’s EnglishLittle common internal structure (no inverted pyramid)

Gary King (Harvard) Content Analysis 7 / 33

Page 43: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

One specific quantity of interest

Subject: the grand conversation about the American presidency

Question: affect about President Bush and 2008 candidates

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Language ranges from “my crunchy gf thinks dubya hid the wmd’s, :)!’to the Queen’s EnglishLittle common internal structure (no inverted pyramid)

Gary King (Harvard) Content Analysis 7 / 33

Page 44: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

One specific quantity of interest

Subject: the grand conversation about the American presidency

Question: affect about President Bush and 2008 candidates

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization

“Sentiment categorization is more difficult than topic classification”Language ranges from “my crunchy gf thinks dubya hid the wmd’s, :)!’to the Queen’s EnglishLittle common internal structure (no inverted pyramid)

Gary King (Harvard) Content Analysis 7 / 33

Page 45: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

One specific quantity of interest

Subject: the grand conversation about the American presidency

Question: affect about President Bush and 2008 candidates

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”

Language ranges from “my crunchy gf thinks dubya hid the wmd’s, :)!’to the Queen’s EnglishLittle common internal structure (no inverted pyramid)

Gary King (Harvard) Content Analysis 7 / 33

Page 46: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

One specific quantity of interest

Subject: the grand conversation about the American presidency

Question: affect about President Bush and 2008 candidates

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Language ranges from “my crunchy gf thinks dubya hid the wmd’s, :)!’to the Queen’s English

Little common internal structure (no inverted pyramid)

Gary King (Harvard) Content Analysis 7 / 33

Page 47: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

One specific quantity of interest

Subject: the grand conversation about the American presidency

Question: affect about President Bush and 2008 candidates

Specific categories: Label Category−2 extremely negative−1 negative

0 neutral1 positive2 extremely positive

NA no opinion expressedNB not a blog

Hard case:

Part ordinal, part nominal categorization“Sentiment categorization is more difficult than topic classification”Language ranges from “my crunchy gf thinks dubya hid the wmd’s, :)!’to the Queen’s EnglishLittle common internal structure (no inverted pyramid)

Gary King (Harvard) Content Analysis 7 / 33

Page 48: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

The Conversation about John Kerry’s Botched Joke

You know, education — if you make the most of it . . . you cando well. If you don’t, you get stuck in Iraq.

●●

●● ●

●●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

2

Gary King (Harvard) Content Analysis 8 / 33

Page 49: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

The Conversation about John Kerry’s Botched Joke

You know, education — if you make the most of it . . . you cando well. If you don’t, you get stuck in Iraq.

●●

●● ●

●●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

2

Gary King (Harvard) Content Analysis 8 / 33

Page 50: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

The Conversation about John Kerry’s Botched Joke

You know, education — if you make the most of it . . . you cando well. If you don’t, you get stuck in Iraq.

●●

●● ●

●●

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−2

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

−1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

0

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Affect Towards John Kerry

2006−2007

Pro

port

ion

Sept Oct Nov Dec Jan Feb Mar

2

Gary King (Harvard) Content Analysis 8 / 33

Page 51: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Representing Text as Numbers

Filter: choose English language blogs that mention Bush (“Bush”,“George W.”, “Dubya”, “King George”, etc.), Hillary Clinton(“Senator Clinton”, “Hillary”, “Hitlery”, “Mrs. Clinton”), etc.

Preprocess: convert to lower case, remove punctuation, performstemming (reduce “consist”, “consisted”, “consistency”, “consistent”,“consistently”, “consisting”, and “consists”, to their stem: “consist”)

Code variables as presence or absence of unique unigrams, bigrams,trigrams, etc.

Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.Unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

Gary King (Harvard) Content Analysis 9 / 33

Page 52: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Representing Text as Numbers

Filter: choose English language blogs that mention Bush (“Bush”,“George W.”, “Dubya”, “King George”, etc.), Hillary Clinton(“Senator Clinton”, “Hillary”, “Hitlery”, “Mrs. Clinton”), etc.

Preprocess: convert to lower case, remove punctuation, performstemming (reduce “consist”, “consisted”, “consistency”, “consistent”,“consistently”, “consisting”, and “consists”, to their stem: “consist”)

Code variables as presence or absence of unique unigrams, bigrams,trigrams, etc.

Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.Unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

Gary King (Harvard) Content Analysis 9 / 33

Page 53: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Representing Text as Numbers

Filter: choose English language blogs that mention Bush (“Bush”,“George W.”, “Dubya”, “King George”, etc.), Hillary Clinton(“Senator Clinton”, “Hillary”, “Hitlery”, “Mrs. Clinton”), etc.

Preprocess: convert to lower case, remove punctuation, performstemming (reduce “consist”, “consisted”, “consistency”, “consistent”,“consistently”, “consisting”, and “consists”, to their stem: “consist”)

Code variables as presence or absence of unique unigrams, bigrams,trigrams, etc.

Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.Unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

Gary King (Harvard) Content Analysis 9 / 33

Page 54: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Representing Text as Numbers

Filter: choose English language blogs that mention Bush (“Bush”,“George W.”, “Dubya”, “King George”, etc.), Hillary Clinton(“Senator Clinton”, “Hillary”, “Hitlery”, “Mrs. Clinton”), etc.

Preprocess: convert to lower case, remove punctuation, performstemming (reduce “consist”, “consisted”, “consistency”, “consistent”,“consistently”, “consisting”, and “consists”, to their stem: “consist”)

Code variables as presence or absence of unique unigrams, bigrams,trigrams, etc.

Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.Unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

Gary King (Harvard) Content Analysis 9 / 33

Page 55: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Representing Text as Numbers

Filter: choose English language blogs that mention Bush (“Bush”,“George W.”, “Dubya”, “King George”, etc.), Hillary Clinton(“Senator Clinton”, “Hillary”, “Hitlery”, “Mrs. Clinton”), etc.

Preprocess: convert to lower case, remove punctuation, performstemming (reduce “consist”, “consisted”, “consistency”, “consistent”,“consistently”, “consisting”, and “consists”, to their stem: “consist”)

Code variables as presence or absence of unique unigrams, bigrams,trigrams, etc.

Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.Unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

Gary King (Harvard) Content Analysis 9 / 33

Page 56: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Representing Text as Numbers

Filter: choose English language blogs that mention Bush (“Bush”,“George W.”, “Dubya”, “King George”, etc.), Hillary Clinton(“Senator Clinton”, “Hillary”, “Hitlery”, “Mrs. Clinton”), etc.

Preprocess: convert to lower case, remove punctuation, performstemming (reduce “consist”, “consisted”, “consistency”, “consistent”,“consistently”, “consisting”, and “consists”, to their stem: “consist”)

Code variables as presence or absence of unique unigrams, bigrams,trigrams, etc.

Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.

Unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

Gary King (Harvard) Content Analysis 9 / 33

Page 57: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Representing Text as Numbers

Filter: choose English language blogs that mention Bush (“Bush”,“George W.”, “Dubya”, “King George”, etc.), Hillary Clinton(“Senator Clinton”, “Hillary”, “Hitlery”, “Mrs. Clinton”), etc.

Preprocess: convert to lower case, remove punctuation, performstemming (reduce “consist”, “consisted”, “consistency”, “consistent”,“consistently”, “consisting”, and “consists”, to their stem: “consist”)

Code variables as presence or absence of unique unigrams, bigrams,trigrams, etc.

Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.Unigrams in > 1% or < 99% of documents: 3,672 variables

Groups infinite possible posts into “only” 23,672 distinct types

Gary King (Harvard) Content Analysis 9 / 33

Page 58: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Representing Text as Numbers

Filter: choose English language blogs that mention Bush (“Bush”,“George W.”, “Dubya”, “King George”, etc.), Hillary Clinton(“Senator Clinton”, “Hillary”, “Hitlery”, “Mrs. Clinton”), etc.

Preprocess: convert to lower case, remove punctuation, performstemming (reduce “consist”, “consisted”, “consistency”, “consistent”,“consistently”, “consisting”, and “consists”, to their stem: “consist”)

Code variables as presence or absence of unique unigrams, bigrams,trigrams, etc.

Example:

Our 10,771 blog posts about Bush and Clinton:201,676 unigrams, 2,392,027 bigrams, 5,761,979 trigrams.Unigrams in > 1% or < 99% of documents: 3,672 variablesGroups infinite possible posts into “only” 23,672 distinct types

Gary King (Harvard) Content Analysis 9 / 33

Page 59: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Notation

Document Category

Di =

-2 extremely negative

-1 negative

0 neutral

1 positive

2 extremely positive

NA no opinion expressed

NB not a blog

Word Stem Profile:

Si =

Si1 = 1 if “awful” is used, 0 if not

Si2 = 1 if “good” is used, 0 if not...

...

SiK = 1 if “except” is used, 0 if not

Gary King (Harvard) Content Analysis 10 / 33

Page 60: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Notation

Document Category

Di =

-2 extremely negative

-1 negative

0 neutral

1 positive

2 extremely positive

NA no opinion expressed

NB not a blog

Word Stem Profile:

Si =

Si1 = 1 if “awful” is used, 0 if not

Si2 = 1 if “good” is used, 0 if not...

...

SiK = 1 if “except” is used, 0 if not

Gary King (Harvard) Content Analysis 10 / 33

Page 61: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Notation

Document Category

Di =

-2 extremely negative

-1 negative

0 neutral

1 positive

2 extremely positive

NA no opinion expressed

NB not a blog

Word Stem Profile:

Si =

Si1 = 1 if “awful” is used, 0 if not

Si2 = 1 if “good” is used, 0 if not...

...

SiK = 1 if “except” is used, 0 if not

Gary King (Harvard) Content Analysis 10 / 33

Page 62: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Quantities of Interest

Computer Science: individual document classifications

D1,D2 . . . , DL

Social Science: proportions in each category

P(D) =

P(D = −2)P(D = −1)P(D = 0)P(D = 1)P(D = 2)

P(D = NA)P(D = NB)

Gary King (Harvard) Content Analysis 11 / 33

Page 63: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Quantities of Interest

Computer Science: individual document classifications

D1,D2 . . . , DL

Social Science: proportions in each category

P(D) =

P(D = −2)P(D = −1)P(D = 0)P(D = 1)P(D = 2)

P(D = NA)P(D = NB)

Gary King (Harvard) Content Analysis 11 / 33

Page 64: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Quantities of Interest

Computer Science: individual document classifications

D1,D2 . . . , DL

Social Science: proportions in each category

P(D) =

P(D = −2)P(D = −1)P(D = 0)P(D = 1)P(D = 2)

P(D = NA)P(D = NB)

Gary King (Harvard) Content Analysis 11 / 33

Page 65: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Issues with Existing Statistical Approaches

1 Direct Sampling

Classification of population documents not necessaryBiased without a random samplenonrandomness common due to population drift, studying datasubdivisions, etc.

2 Aggregation of model-based individual classifications

Biased if not random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard) Content Analysis 12 / 33

Page 66: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Issues with Existing Statistical Approaches

1 Direct Sampling

Classification of population documents not necessaryBiased without a random samplenonrandomness common due to population drift, studying datasubdivisions, etc.

2 Aggregation of model-based individual classifications

Biased if not random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard) Content Analysis 12 / 33

Page 67: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Issues with Existing Statistical Approaches

1 Direct Sampling

Classification of population documents not necessary

Biased without a random samplenonrandomness common due to population drift, studying datasubdivisions, etc.

2 Aggregation of model-based individual classifications

Biased if not random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard) Content Analysis 12 / 33

Page 68: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Issues with Existing Statistical Approaches

1 Direct Sampling

Classification of population documents not necessaryBiased without a random sample

nonrandomness common due to population drift, studying datasubdivisions, etc.

2 Aggregation of model-based individual classifications

Biased if not random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard) Content Analysis 12 / 33

Page 69: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Issues with Existing Statistical Approaches

1 Direct Sampling

Classification of population documents not necessaryBiased without a random samplenonrandomness common due to population drift, studying datasubdivisions, etc.

2 Aggregation of model-based individual classifications

Biased if not random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard) Content Analysis 12 / 33

Page 70: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Issues with Existing Statistical Approaches

1 Direct Sampling

Classification of population documents not necessaryBiased without a random samplenonrandomness common due to population drift, studying datasubdivisions, etc.

2 Aggregation of model-based individual classifications

Biased if not random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard) Content Analysis 12 / 33

Page 71: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Issues with Existing Statistical Approaches

1 Direct Sampling

Classification of population documents not necessaryBiased without a random samplenonrandomness common due to population drift, studying datasubdivisions, etc.

2 Aggregation of model-based individual classifications

Biased if not random sample

Models P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard) Content Analysis 12 / 33

Page 72: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Issues with Existing Statistical Approaches

1 Direct Sampling

Classification of population documents not necessaryBiased without a random samplenonrandomness common due to population drift, studying datasubdivisions, etc.

2 Aggregation of model-based individual classifications

Biased if not random sampleModels P(D|S), but the world works as P(S|D)

Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard) Content Analysis 12 / 33

Page 73: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Issues with Existing Statistical Approaches

1 Direct Sampling

Classification of population documents not necessaryBiased without a random samplenonrandomness common due to population drift, studying datasubdivisions, etc.

2 Aggregation of model-based individual classifications

Biased if not random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard) Content Analysis 12 / 33

Page 74: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Issues with Existing Statistical Approaches

1 Direct Sampling

Classification of population documents not necessaryBiased without a random samplenonrandomness common due to population drift, studying datasubdivisions, etc.

2 Aggregation of model-based individual classifications

Biased if not random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.

S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard) Content Analysis 12 / 33

Page 75: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Issues with Existing Statistical Approaches

1 Direct Sampling

Classification of population documents not necessaryBiased without a random samplenonrandomness common due to population drift, studying datasubdivisions, etc.

2 Aggregation of model-based individual classifications

Biased if not random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard) Content Analysis 12 / 33

Page 76: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Issues with Existing Statistical Approaches

1 Direct Sampling

Classification of population documents not necessaryBiased without a random samplenonrandomness common due to population drift, studying datasubdivisions, etc.

2 Aggregation of model-based individual classifications

Biased if not random sampleModels P(D|S), but the world works as P(S|D)Bias unless

P(D|S) encompasses the “true” model.S spans the space of all predictors of D (i.e., all information in thedocument)

Bias even with optimal classification and high % correctly classified

Gary King (Harvard) Content Analysis 12 / 33

Page 77: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Use labeled set to estimate misclassification rates (by cross-validation)

Aggregate classifications to category proportions

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(Assumes misclassification rates are estimated well)

(still requires individual classification)

Gary King (Harvard) Content Analysis 13 / 33

Page 78: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Use labeled set to estimate misclassification rates (by cross-validation)

Aggregate classifications to category proportions

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(Assumes misclassification rates are estimated well)

(still requires individual classification)

Gary King (Harvard) Content Analysis 13 / 33

Page 79: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Use labeled set to estimate misclassification rates (by cross-validation)

Aggregate classifications to category proportions

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(Assumes misclassification rates are estimated well)

(still requires individual classification)

Gary King (Harvard) Content Analysis 13 / 33

Page 80: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Use labeled set to estimate misclassification rates (by cross-validation)

Aggregate classifications to category proportions

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(Assumes misclassification rates are estimated well)

(still requires individual classification)

Gary King (Harvard) Content Analysis 13 / 33

Page 81: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Use labeled set to estimate misclassification rates (by cross-validation)

Aggregate classifications to category proportions

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(Assumes misclassification rates are estimated well)

(still requires individual classification)

Gary King (Harvard) Content Analysis 13 / 33

Page 82: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Use labeled set to estimate misclassification rates (by cross-validation)

Aggregate classifications to category proportions

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(Assumes misclassification rates are estimated well)

(still requires individual classification)

Gary King (Harvard) Content Analysis 13 / 33

Page 83: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Use labeled set to estimate misclassification rates (by cross-validation)

Aggregate classifications to category proportions

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(Assumes misclassification rates are estimated well)

(still requires individual classification)

Gary King (Harvard) Content Analysis 13 / 33

Page 84: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Using Misclassification Rates to Correct Proportions

Use some method to classify unlabeled documents

Use labeled set to estimate misclassification rates (by cross-validation)

Aggregate classifications to category proportions

Use misclassification rates to correct proportions

Result: vastly improved estimates of category proportions

(Assumes misclassification rates are estimated well)

(still requires individual classification)

Gary King (Harvard) Content Analysis 13 / 33

Page 85: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Formalization from Epidemiology(Levy and Kass, 1970)

Accounting identity for 2 categories:

P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)

Solve:

P(D = 1) =P(D̂ = 1)− (1− spec)

sens− (1− spec)

Use this equation to correct P(D̂)

Gary King (Harvard) Content Analysis 14 / 33

Page 86: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Formalization from Epidemiology(Levy and Kass, 1970)

Accounting identity for 2 categories:

P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)

Solve:

P(D = 1) =P(D̂ = 1)− (1− spec)

sens− (1− spec)

Use this equation to correct P(D̂)

Gary King (Harvard) Content Analysis 14 / 33

Page 87: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Formalization from Epidemiology(Levy and Kass, 1970)

Accounting identity for 2 categories:

P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)

Solve:

P(D = 1) =P(D̂ = 1)− (1− spec)

sens− (1− spec)

Use this equation to correct P(D̂)

Gary King (Harvard) Content Analysis 14 / 33

Page 88: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Formalization from Epidemiology(Levy and Kass, 1970)

Accounting identity for 2 categories:

P(D̂ = 1) = (sens)P(D = 1) + (1− spec)P(D = 2)

Solve:

P(D = 1) =P(D̂ = 1)− (1− spec)

sens− (1− spec)

Use this equation to correct P(D̂)

Gary King (Harvard) Content Analysis 14 / 33

Page 89: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Generalizations: J Categories, No Individual Classification(King and Lu, 2007)

Accounting identity for J categories

P(D̂ = j) =J∑

j ′=1

P(D̂ = j |D = j ′)P(D = j ′)

Drop D̂ calculation, since D̂ = f (S):

P(S = s) =J∑

j ′=1

P(S = s|D = j ′)P(D = j ′)

Simplify to an equivalent matrix expression:

P(S) = P(S|D)P(D)

Gary King (Harvard) Content Analysis 15 / 33

Page 90: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Generalizations: J Categories, No Individual Classification(King and Lu, 2007)

Accounting identity for J categories

P(D̂ = j) =J∑

j ′=1

P(D̂ = j |D = j ′)P(D = j ′)

Drop D̂ calculation, since D̂ = f (S):

P(S = s) =J∑

j ′=1

P(S = s|D = j ′)P(D = j ′)

Simplify to an equivalent matrix expression:

P(S) = P(S|D)P(D)

Gary King (Harvard) Content Analysis 15 / 33

Page 91: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Generalizations: J Categories, No Individual Classification(King and Lu, 2007)

Accounting identity for J categories

P(D̂ = j) =J∑

j ′=1

P(D̂ = j |D = j ′)P(D = j ′)

Drop D̂ calculation, since D̂ = f (S):

P(S = s) =J∑

j ′=1

P(S = s|D = j ′)P(D = j ′)

Simplify to an equivalent matrix expression:

P(S) = P(S|D)P(D)

Gary King (Harvard) Content Analysis 15 / 33

Page 92: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Generalizations: J Categories, No Individual Classification(King and Lu, 2007)

Accounting identity for J categories

P(D̂ = j) =J∑

j ′=1

P(D̂ = j |D = j ′)P(D = j ′)

Drop D̂ calculation, since D̂ = f (S):

P(S = s) =J∑

j ′=1

P(S = s|D = j ′)P(D = j ′)

Simplify to an equivalent matrix expression:

P(S) = P(S|D)P(D)

Gary King (Harvard) Content Analysis 15 / 33

Page 93: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 94: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Document category proportions (quantity of interest)

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 95: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Word stem profile proportions (estimate in unlabeled set by tabulation)

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 96: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Word stem profiles, by category (estimate in labeled set by tabulation)

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 97: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ

=⇒ β = (X ′X )−1X ′y

Alternative symbols (to emphasize the linear equation)

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 98: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Solve for quantity of interest (with no error term)

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 99: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 100: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computer

P(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 101: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparse

Elements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 102: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 103: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 104: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average results

Equivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 105: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical data

Use constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 106: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 107: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Estimation

The matrix expression again:

P(S)2K×1

= P(S|D)2K×J

P(D)J×1

=⇒ Y = Xβ =⇒ β = (X ′X )−1X ′y

Technical estimation issues:

2K is enormous, far larger than any existing computerP(S) and P(S|D) will be too sparseElements of P(D) must be between 0 and 1 and sum to 1

Solutions

Use subsets of S; average resultsEquivalent to kernel density smoothing of sparse categorical dataUse constrained LS to constrain P(D) to simplex

Uncertainty estimates by bootstrapping

Gary King (Harvard) Content Analysis 16 / 33

Page 108: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

A Nonrandom Hand-coded Sample

0.0 0.1 0.2 0.3 0.4

0.0

0.1

0.2

0.3

0.4

Differences in Document Category Frequencies

Ph(D)

P(D

)

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

0.01 0.02 0.05 0.10

0.01

0.02

0.05

0.10

Differences in Word Profile Frequencies

Ph(S)

P(S

)

All existing methods would fail with these data.

Gary King (Harvard) Content Analysis 17 / 33

Page 109: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Accurate Estimates

0.0 0.1 0.2 0.3 0.4

0.0

0.1

0.2

0.3

0.4

Actual P(D)

Est

imat

ed P

(D)

Gary King (Harvard) Content Analysis 18 / 33

Page 110: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Out of Sample Validation: Blogs

●●

0.0 0.1 0.2 0.3 0.4

0.0

0.1

0.2

0.3

0.4

Affect in Blogs

Actual P(D)

Est

imat

ed P

(D)

Gary King (Harvard) Content Analysis 19 / 33

Page 111: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Out of Sample Validation: Other Examples

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.1

0.2

0.3

0.4

0.5

Movie Reviews

Actual P(D)

Est

imat

ed P

(D)

●●

0.0 0.1 0.2 0.3 0.4 0.50.

00.

10.

20.

30.

40.

5

University Websites

Actual P(D)

Est

imat

ed P

(D)

Gary King (Harvard) Content Analysis 20 / 33

Page 112: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Bias by Number of Hand Coded Documents

Nonparametric Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

−2

Nonparametric Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

−1

Nonparametric Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

0

Nonparametric Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

1

Nonparametric Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

2

Nonparametric Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

NB

Nonparametric Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

NO

Sampling Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

−2

Sampling Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

−1

Sampling Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

0

Sampling Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

1

Sampling Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

2

Sampling Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

NB

Sampling Estimator

Number of Hand−Coded Documents

Bia

s

−.1

0−

.05

.00

.05

.10

0 200 400 600 800 1000

NO

Gary King (Harvard) Content Analysis 21 / 33

Page 113: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Average RMSE by Number of Hand Coded Documents

200 400 600 800 1000

0.00

0.01

0.02

0.03

0.04

Number of Hand−Coded Documents200 400 600 800 1000

0.00

0.01

0.02

0.03

0.04

Number of Hand−Coded Documents

Avg

. Roo

t Mea

n S

quar

ed E

rror

Gary King (Harvard) Content Analysis 22 / 33

Page 114: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Misclassification Matrix for Blog Posts

-2 -1 0 1 2 NA NB P(D1)

-2 .70 .10 .01 .01 .00 .02 .16 .28-1 .33 .25 .04 .02 .01 .01 .35 .080 .13 .17 .13 .11 .05 .02 .40 .021 .07 .06 .08 .20 .25 .01 .34 .032 .03 .03 .03 .22 .43 .01 .25 .03

NA .04 .01 .00 .00 .00 .81 .14 .12NB .10 .07 .02 .02 .02 .04 .75 .45

Gary King (Harvard) Content Analysis 23 / 33

Page 115: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

SIMEX Analysis of “Not a Blog” Category

0.0

0.2

0.4

0.6

0.8

1.0

Category NB

αα−1 0

0.0

0.2

0.4

0.6

0.8

1.0

αα

Gary King (Harvard) Content Analysis 24 / 33

Page 116: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

SIMEX Analysis of “Not a Blog” Category

0.0

0.2

0.4

0.6

0.8

1.0

Category NB

αα−1 0

0.0

0.2

0.4

0.6

0.8

1.0

αα

0.0

0.2

0.4

0.6

0.8

1.0

αα

Gary King (Harvard) Content Analysis 25 / 33

Page 117: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

SIMEX Analysis of “Not a Blog” Category

−1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Category NB

αα−1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

Category NB

αα

0.0

0.2

0.4

0.6

0.8

1.0

αα

0.0

0.2

0.4

0.6

0.8

1.0

αα

Gary King (Harvard) Content Analysis 26 / 33

Page 118: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

SIMEX Analysis of Other Categories

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category −2

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category −2

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category −1

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category −1

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 0

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 0

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 1

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 1

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 2

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category 2

αα

−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category NA

αα−1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Category NA

αα

Gary King (Harvard) Content Analysis 27 / 33

Page 119: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

What can go wrong?

We assume Ph(S|D) = P(S|D)

Must choose word stem subset size (a smoothing parameter)

Need enough labeled documents in each category (can hand codemore if CI’s are too large, perhaps via case-control methods)

Need sufficient information in: documents, categorization scheme,numerical summaries of the documents, and hand-codings

Use additional hand coding to verify assumptions

Gary King (Harvard) Content Analysis 28 / 33

Page 120: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

What can go wrong?

We assume Ph(S|D) = P(S|D)

Must choose word stem subset size (a smoothing parameter)

Need enough labeled documents in each category (can hand codemore if CI’s are too large, perhaps via case-control methods)

Need sufficient information in: documents, categorization scheme,numerical summaries of the documents, and hand-codings

Use additional hand coding to verify assumptions

Gary King (Harvard) Content Analysis 28 / 33

Page 121: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

What can go wrong?

We assume Ph(S|D) = P(S|D)

Must choose word stem subset size (a smoothing parameter)

Need enough labeled documents in each category (can hand codemore if CI’s are too large, perhaps via case-control methods)

Need sufficient information in: documents, categorization scheme,numerical summaries of the documents, and hand-codings

Use additional hand coding to verify assumptions

Gary King (Harvard) Content Analysis 28 / 33

Page 122: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

What can go wrong?

We assume Ph(S|D) = P(S|D)

Must choose word stem subset size (a smoothing parameter)

Need enough labeled documents in each category (can hand codemore if CI’s are too large, perhaps via case-control methods)

Need sufficient information in: documents, categorization scheme,numerical summaries of the documents, and hand-codings

Use additional hand coding to verify assumptions

Gary King (Harvard) Content Analysis 28 / 33

Page 123: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

What can go wrong?

We assume Ph(S|D) = P(S|D)

Must choose word stem subset size (a smoothing parameter)

Need enough labeled documents in each category (can hand codemore if CI’s are too large, perhaps via case-control methods)

Need sufficient information in: documents, categorization scheme,numerical summaries of the documents, and hand-codings

Use additional hand coding to verify assumptions

Gary King (Harvard) Content Analysis 28 / 33

Page 124: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

What can go wrong?

We assume Ph(S|D) = P(S|D)

Must choose word stem subset size (a smoothing parameter)

Need enough labeled documents in each category (can hand codemore if CI’s are too large, perhaps via case-control methods)

Need sufficient information in: documents, categorization scheme,numerical summaries of the documents, and hand-codings

Use additional hand coding to verify assumptions

Gary King (Harvard) Content Analysis 28 / 33

Page 125: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard) Content Analysis 29 / 33

Page 126: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard) Content Analysis 29 / 33

Page 127: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policies

High quality death registration: only 23/192 countries

Existing Approaches

Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard) Content Analysis 29 / 33

Page 128: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard) Content Analysis 29 / 33

Page 129: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard) Content Analysis 29 / 33

Page 130: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Ask relatives or caregivers 50-100 symptom questions

Ask physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard) Content Analysis 29 / 33

Page 131: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)

Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard) Content Analysis 29 / 33

Page 132: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)

Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard) Content Analysis 29 / 33

Page 133: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Verbal Autopsy Methods

The Problem

Policymakers need the cause-specific mortality rate to set researchgoals, budgetary priorities, and ameliorative policiesHigh quality death registration: only 23/192 countries

Existing Approaches

Ask relatives or caregivers 50-100 symptom questionsAsk physicians to determine cause of death (low intercoder reliability)Apply expert algorithms (high reliability, low validity)Find deaths with medically certified causes from a local hospital, tracecaregivers to their homes, ask the same symptom questions, andstatistically classify deaths in population (model-dependent, lowaccuracy)

Gary King (Harvard) Content Analysis 29 / 33

Page 134: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

An Alternative Approach

Document Category, Cause of Death,

Di =

1 if bladder cancer

2 if cardiovascular disease

3 if transportation accident...

...

J if infectious respiratory

Word Stem Profile, Symptoms:

Si =

Si1 = 1 if “breathing difficulties”, 0 if not

Si2 = 1 if “stomach ache”, 0 if not...

...

SiK = 1 if “diarrhea”, 0 if not

Apply the same methods

Gary King (Harvard) Content Analysis 30 / 33

Page 135: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

An Alternative Approach

Document Category, Cause of Death,

Di =

1 if bladder cancer

2 if cardiovascular disease

3 if transportation accident...

...

J if infectious respiratory

Word Stem Profile, Symptoms:

Si =

Si1 = 1 if “breathing difficulties”, 0 if not

Si2 = 1 if “stomach ache”, 0 if not...

...

SiK = 1 if “diarrhea”, 0 if not

Apply the same methods

Gary King (Harvard) Content Analysis 30 / 33

Page 136: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

An Alternative Approach

Document Category, Cause of Death,

Di =

1 if bladder cancer

2 if cardiovascular disease

3 if transportation accident...

...

J if infectious respiratory

Word Stem Profile, Symptoms:

Si =

Si1 = 1 if “breathing difficulties”, 0 if not

Si2 = 1 if “stomach ache”, 0 if not...

...

SiK = 1 if “diarrhea”, 0 if not

Apply the same methods

Gary King (Harvard) Content Analysis 30 / 33

Page 137: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

An Alternative Approach

Document Category, Cause of Death,

Di =

1 if bladder cancer

2 if cardiovascular disease

3 if transportation accident...

...

J if infectious respiratory

Word Stem Profile, Symptoms:

Si =

Si1 = 1 if “breathing difficulties”, 0 if not

Si2 = 1 if “stomach ache”, 0 if not...

...

SiK = 1 if “diarrhea”, 0 if not

Apply the same methods

Gary King (Harvard) Content Analysis 30 / 33

Page 138: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Validation in China

● ●

●●

●●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Random Split Sample

True

Est

imat

e

● ●

●●

●●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Bigger Cities

True

Est

imat

e

● ●

● ●

●●

●●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Smaller Cities

True

Est

imat

e

Gary King (Harvard) Content Analysis 31 / 33

Page 139: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

Validation in Tanzania

●●

● ●

●●

●●

●●●●●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Adult

True

Est

imat

e

●●●

●●●

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Child

True

Est

imat

e

Gary King (Harvard) Content Analysis 32 / 33

Page 140: How to Read 100 Million Blogs (& Classify Deaths Without ... · Prominent early social scientists used it: Berelson, de Grazia, etc. Spread to vast array of fields (use increased

For more information

http://GKing.Harvard.edu

Gary King (Harvard) Content Analysis 33 / 33