big$dataversusthe$crowd · *mike#...

Big Data Versus the Crowd Looking for Rela8onships in All the Right Places

Ce Zhang, Feng Niu, Christopher Ré, and Jude Shavlik

Execu8ve Summary Big Data vs. the Crowd

Methodology: Follow State-‐of-‐the-‐art DS Scheme

Feature Extrac8on Distant Supervision

To Combat Inaccurate Labels in DS

How does increasing the corpus size impact the quality of DS?

How does increasing the amount of human feedback impact the quality of DS?

Hypothesis: larger the corpus, bePer the quality

Men8on Obama Barack Obama B. Obama

Men8on1 Men8on2 Feature B. Obama Michelle

Men8on En8ty Obama Barack Obama B. Obama

Men8on extrac8on

Linguis8c paPern extrac8on

En8ty Linking

•  Person, LocaDon, OrganizaDon ….

•  Use Stanford CoreNLP

•  Link menDons to Freebase using exact string matching

•  Dependency path between menDons.

•  Word sequence between menDons.

PERSON

PERSON

marry nsubj

dobj

Distant Supervision (DS) Given Freebase as a knowledge base

find menDon pairs supporDng Freebase.

En8ty En8ty

Human Feedback (HF) Annotate menDon pairs generated by distant supervision with Y/N: -‐ 3 turkers for each pair -‐ with quality control

Hypothesis: more human feedback, bePer the quality

Distant Supervision HasSpouse

Lilly LedbeZer joined Obama and his wife, Michelle,…

Senator Barack Obama and Michelle Obama …

En8ty En8ty

Use broad coverage and redundancy in the large corpus.

Ask humans in the crowd to provide feedbacks.

* Mike Mintz et al.. Distant supervision for relaDon extracDon without labeled data. In ACL 2009.

Takeaways When developing a DS system, one should first expand the training Corpus in order to improve recall and then worry about the precision of training examples.

Mo8va8ng Applica8on: DeepDive DeepDive approaches relaDon extracDon using human analysts and staDsDcal signals from terabytes of data. It applies distant supervision to generate training examples from unlabeled text corpus.

DeepDive

hZp://research.cs.wisc.edu/hazy/deepdive/

Barack Obama brought his wife Michelle to Kenya three years later… Text

Subject Spouse Barack Obama Michelle Obama George W. Bush Laura Welch Bill Clinton Hillary Rodham George H. W. Bush Barbara Pierce

HasSpouse

Facts

Contribu8on Empirically study the factors that contribute to distant supervision quality and their relaDve impact.

ClueWeb Future Work •  Study more sophis8cated models (than LR)

and distant supervision scheme. •  Explore other (more effec8ve) usage of

human feedbacks.

Experiments

Experiment Setup

Machine Learning Model Train a logisDc regression model using training data obtained from DS and HF.

To Study our Hypotheses… •  Subsample the document corpus to get

different DS training set. •  Subsample the human annotaDons to get

different HF training set.

Data sets (another data set reported in paper) TAC-‐KBP: 1.8M news arDcles ClueWeb: 500M Web pages

1000

10000

100000

1000000

1800000

Corpus Size

Human Feedback Amount

F1

Impact of Big Data

The larger the corpus size is, the higher the quality we can expect.

F1 is a log-‐linear func8on in the corpus size that we used for distant supervision.

Large corpus increases the coverage of linguis8c paPerns in DS.

0 0.1 0.2 0.3 0.4 0.5

Precision

0

0.1

0.2

Recall

Corpus Size

TAC-‐KBP

Human Feedback Amount

Impact of the Crowd

Human feedback has comparable precision as DS in our current protocol.

0 0.1 0.2 0.3 0.4 0.5

Precision

0

0.1

0.2

Recall

But the slope is smaller than increasing the corpus size.

F1 increases sta8s8cally significantly when we provide more human feedbacks.

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

1.E+2 1.E+3 1.E+4 1.E+5 1.E+6 1.E+7 1.E+8

F1, R

ecall, or

Precision

Corpus size

F1 Recall Precision

Without human feedback With full human feedback 1K-‐doc corpus 1.8M-‐doc corpus

ClueWeb This work is generously supported by the Air Force Research Laboratory (AFRL) under prime contract no. FA8750-‐09-‐C-‐0181. Any opinions, findings, and conclusion or recommendaDons expressed in this work are those of the authors and do not necessarily reflect the views of any of the above sponsors including DARPA, AFRL or the US government.

Follow state-‐of-‐the-‐art approaches in each step, but study them in a new level of scalability (up to 100 million documents).

Department of Computer Science, University of Wisconsin-‐Madison, USA

Distant Supervision(DS)* automaDcally generates labeled training data from a text corpus. However, some labels are not accurate.

0

0.20

0.10

0.15

0.05

big$dataversusthe$crowd · *mike#...

Documents