big$dataversusthe$crowd · *mike#...
TRANSCRIPT
Big Data Versus the Crowd Looking for Rela8onships in All the Right Places
Ce Zhang, Feng Niu, Christopher Ré, and Jude Shavlik
Execu8ve Summary Big Data vs. the Crowd
Methodology: Follow State-‐of-‐the-‐art DS Scheme
Feature Extrac8on Distant Supervision
To Combat Inaccurate Labels in DS
How does increasing the corpus size impact the quality of DS?
How does increasing the amount of human feedback impact the quality of DS?
Hypothesis: larger the corpus, bePer the quality
Men8on Obama Barack Obama B. Obama
Men8on1 Men8on2 Feature B. Obama Michelle
Men8on En8ty Obama Barack Obama B. Obama
Men8on extrac8on
Linguis8c paPern extrac8on
En8ty Linking
• Person, LocaDon, OrganizaDon ….
• Use Stanford CoreNLP
• Link menDons to Freebase using exact string matching
• Dependency path between menDons.
• Word sequence between menDons.
PERSON
PERSON
marry nsubj
dobj
Distant Supervision (DS) Given Freebase as a knowledge base
find menDon pairs supporDng Freebase.
En8ty En8ty
Human Feedback (HF) Annotate menDon pairs generated by distant supervision with Y/N: -‐ 3 turkers for each pair -‐ with quality control
Hypothesis: more human feedback, bePer the quality
Distant Supervision HasSpouse
Lilly LedbeZer joined Obama and his wife, Michelle,…
Senator Barack Obama and Michelle Obama …
En8ty En8ty
Use broad coverage and redundancy in the large corpus.
Ask humans in the crowd to provide feedbacks.
* Mike Mintz et al.. Distant supervision for relaDon extracDon without labeled data. In ACL 2009.
Takeaways When developing a DS system, one should first expand the training Corpus in order to improve recall and then worry about the precision of training examples.
Mo8va8ng Applica8on: DeepDive DeepDive approaches relaDon extracDon using human analysts and staDsDcal signals from terabytes of data. It applies distant supervision to generate training examples from unlabeled text corpus.
DeepDive
hZp://research.cs.wisc.edu/hazy/deepdive/
Barack Obama brought his wife Michelle to Kenya three years later… Text
Subject Spouse Barack Obama Michelle Obama George W. Bush Laura Welch Bill Clinton Hillary Rodham George H. W. Bush Barbara Pierce
HasSpouse
Facts
Contribu8on Empirically study the factors that contribute to distant supervision quality and their relaDve impact.
ClueWeb Future Work • Study more sophis8cated models (than LR)
and distant supervision scheme. • Explore other (more effec8ve) usage of
human feedbacks.
Experiments
Experiment Setup
Machine Learning Model Train a logisDc regression model using training data obtained from DS and HF.
To Study our Hypotheses… • Subsample the document corpus to get
different DS training set. • Subsample the human annotaDons to get
different HF training set.
Data sets (another data set reported in paper) TAC-‐KBP: 1.8M news arDcles ClueWeb: 500M Web pages
1000
10000
100000
1000000
1800000
Corpus Size
Human Feedback Amount
F1
Impact of Big Data
The larger the corpus size is, the higher the quality we can expect.
F1 is a log-‐linear func8on in the corpus size that we used for distant supervision.
Large corpus increases the coverage of linguis8c paPerns in DS.
0 0.1 0.2 0.3 0.4 0.5
Precision
0
0.1
0.2
Recall
Corpus Size
TAC-‐KBP
Human Feedback Amount
Impact of the Crowd
Human feedback has comparable precision as DS in our current protocol.
0 0.1 0.2 0.3 0.4 0.5
Precision
0
0.1
0.2
Recall
But the slope is smaller than increasing the corpus size.
F1 increases sta8s8cally significantly when we provide more human feedbacks.
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35
1.E+2 1.E+3 1.E+4 1.E+5 1.E+6 1.E+7 1.E+8
F1, R
ecall, or
Precision
Corpus size
F1 Recall Precision
Without human feedback With full human feedback 1K-‐doc corpus 1.8M-‐doc corpus
ClueWeb This work is generously supported by the Air Force Research Laboratory (AFRL) under prime contract no. FA8750-‐09-‐C-‐0181. Any opinions, findings, and conclusion or recommendaDons expressed in this work are those of the authors and do not necessarily reflect the views of any of the above sponsors including DARPA, AFRL or the US government.
Follow state-‐of-‐the-‐art approaches in each step, but study them in a new level of scalability (up to 100 million documents).
Department of Computer Science, University of Wisconsin-‐Madison, USA
Distant Supervision(DS)* automaDcally generates labeled training data from a text corpus. However, some labels are not accurate.
0
0.20
0.10
0.15
0.05