deriving a web-scale commonsense fact database
DESCRIPTION
Deriving a Web-Scale Commonsense Fact Database . Niket Tandon Gerard de Melo Gerhard Weikum Max Planck Institute for Informatics. Aug 11, 2011. Some trivial facts. Apples are green, red, juicy, sweet but not fast, funny. Parks and meadows are green or lively but not black or slow. - PowerPoint PPT PresentationTRANSCRIPT
Deriving a Web-Scale Commonsense Fact Database
Niket Tandon Gerard de Melo Gerhard Weikum Max Planck Institute for Informatics
Aug 11, 2011
SOME TRIVIAL FACTS..
Apples are green, red, juicy, sweet but not fast, funny
2
Parks and meadows are green or lively but not black or slow
Keys are kept in pocket but not in the air
Question: How do computers know? Solution: Build a commonsense knowledge base
INTRODUCTION
What is the problem? Harvest commonsense facts from text:
– Flower is soft, hasProperty(flower,soft)– Room is part of house partOf(room, house)
Why is it hard? Rarely mentioned in text Noise with natural language text
What is required to tackle the problem?Web Scale corpus but.. Web Scale corpus is hard to get! Use Web Scale N-grams => poses interesting research
challenges
3
MESSAGE OF THE TALK
N-grams simulate larger corpus
Existing information extraction models must be carefully adapted for harvesting facts.
4
AGENDA
Introduction
Web N-grams
Pattern ranking
Extraction and ranking facts
1
2
3
4
5
Pattern based Information extraction model
5
AGENDA
Introduction
Web N-grams
Pattern ranking
Extraction and ranking facts
1
2
3
4
5
Pattern based Information extraction model
6
(fire, hot) X is very Y
(ice, cold)(flower, beautiful)
GOOD SEEDS => (GOOD) PATTERNS
7
Text •He bought very sweet apples
Seed facts •hasProperty: <apple, sweet>
Candidate patterns
GOOD SEEDS => (GOOD) PATTERNS
8
Text •He bought very sweet apples
Seed facts •hasProperty: <apple, sweet>
Candidate patterns
•[hasProperty]
•“He bought very Y X”
GOOD SEEDS => (GOOD) PATTERNS
9
Text •He bought very sweet apples•Apples and sweet potato are delicious
Seed facts •hasProperty: <apple, sweet>
Candidate patterns
•[hasProperty]
•“He bought very Y X”•“X and Y potato are delicious”
GOOD SEEDS => (GOOD) PATTERNS
10
Text •He bought very sweet apples•Apples and sweet potato are delicious•He kept the keys in pocket
Seed facts •hasProperty: <apple, sweet>
•hasLocation:<key, pocket>
Candidate patterns
•[hasProperty] [hasLocation]•“He bought very Y X” “He kept the X in Y”•“X and Y potato are delicious”
GOOD PATTERNS => (GOOD) TUPLES
He kept the butter in
refrigerator
[hasLocation] “He kept the
X in Y”
[hasLocation]
<butter, refrigerator>
11
MODEL
Fact Extraction and Ranking
Pattern rankingPattern Induction[hasProperty]
“He bought very Y X”“X and Y potato are delicious”
[hasLocation]“He kept the X in Y”
Seeds
12
STATE OF THE ART - PATTERN BASED IE
Dipre - Brin ‘98 Snowball - Agichtein et al. ‘00 KnowItAll - Etzioni et al. ’04
Observations: Low Recall on easily available corpora (large corpus is
difficult to get) Low Precision when applied to our corpus
13
AGENDA
Introduction
Web N-grams
Pattern ranking
Extraction and ranking facts
1
2
3
4
5
Pattern based Information extraction model
14
Introduction
Web N-grams
Pattern ranking
Extraction and ranking facts
Pattern based Information extraction model
The corpus we use to extract facts is Web Scale N-grams
WEB-SCALE N-GRAMS
N-gram: sequence of N consecutive word tokens e.g. the apples are very red
Web-scale N-gram statistics derived from trillion words e.g. the apples are very red 12000
Google N-grams, Microsoft N-grams, Yahoo N-grams N-gram dataset limitations
Length <= 5 cannot! => the apple that he was eating was very red
But... Most commonsense relations fit this small context Sheer Volume of data
15
EXAMPLE OF COMMONSENSE RELATIONS
16
OUR APPROACH Use ConceptNet data as seeds to harvest commonsense
facts from Google N-Grams corpus ConceptNet: MIT’s common sense knowledge base
constructed by crowd-sourcing and further processed We take very large number of seeds
Avoids drift with iterations We consider variations of seeds for nouns (plural form)
[key, pocket] , [keys, pocket] ,[keys, pockets] Gives very large number of potential patterns, but most are
noise Constrain patterns by Part Of Speech tags.
X<noun> is very Y<adjective> Need to carefully rank potential patterns
17
patternsfacts
AGENDA
Introduction
Web N-grams
Pattern ranking
Extraction and ranking facts
1
2
3
4
5
Pattern based Information extraction model
18
One dirty fish spoils the whole pond!
EXISTING PATTERN RANKING APPROACH: PMI
PMI score for pattern (p) with matching seeds(x,y)
19
Raw frequencies, not distinct seeds - Bias towards rare events (strings containing seed words by chance)
Frequencies alone are not enough (spam, boilerplate text)
PATTERN RANKING – OBSERVATION 1
Observation 1: • # seeds a pattern matches
follows power law. • Unreliable patterns likely
in tail
Question 1: Can we find
patterns not in tail?
20
Score based on Observation 1 employs gradient (!
threshold)
Power-Law curve s(x) ~ axk
PATTERN RANKING – OBSERVATION 2
Observation 2: • Some patterns match many
seeds but .. match all sort of things e.g. <X> and <Y> matches seeds of different relations that we have
• PMI does not consider #relations matched
Question 2: Can we penalize them?
21
Score of a pattern:
PATTERN RANKING – OUR APPROACH
Combined Pattern Score:
22
Combine scores based on observations 1 and 2 using logistic function
IMPROVEMENT OVER PMI IN PATTERN RANKING
(ISA RELATION)
Top-Ranked Patterns (PMI) Top-Ranked Patterns (q)Varsity <Y> <X> Men <Y>/<X><Y> MLB <X> <Y> : <X> </S><Y> <X> Boys <Y> <X> </S><Y> Posters <X> Basketball <Y> - <X> </S><Y> - <X> Basketball <Y> such as <X><Y> MLB <X> NBA <S> <X> <Y><Y> Badminton <X> <X> and other <Y>
23
San Francisco and
other cities
AGENDA
Introduction
Web N-grams
Pattern ranking
Extraction and ranking facts
1
2
3
4
5
Pattern based Information extraction model
24
ESTIMATE FACT CONFIDENCE : SIMPLE APPROACH
recipes yummy[16:130, 19:51, 21:55, 98:219, 10:80, 63:180, 29:51, 121:57]
title unique[3:111,2:63,114:91,1:213,0:788,41:246,55:95,22:112,18:75,9:48,60:64,14:71]
apples nutritious[12:144]
applet unable[11:62]
25
Pattern idFrequency
• Matches several patterns• Matches few patterns• Matches few patterns
Gives low recall but high precision
Good tuples match many patterns
OUR FACT RANKING APPROACH
These pattern count feature vectors are used to learn a Decision Tree.
Gives facts with estimated confidence
P1 P2 P3 … labelTuple 1 f(T1,P1) f(T1,P2) f(T1,P2) Positive
Tuple 2 f(T2,P1) f(T2,P2) f(T3,P2) Negative
Tuple 3 f(T2,P1) f(T2,P2) f(T3,P2) Positive
…
26
RECAP: MODEL
Construct Seeds from Conceptnet,
Pattern Induction over Google 5-grams
Pattern Ranking (match many seeds but not too many)
Fact Extraction with clean patterns
over Google 5-grams
27
EXPERIMENTAL SETUP
Test Data (true and false labels): Randomly chosen high confidence facts from ConceptNet
Precision and Recall computed using 10-fold cross-validation over the test data
Classifier used: Decision Trees with Adaptive Boosting
28
RESULTS – MORE THAN 200 MILLION FACTS EXTRACTED
Relation Precision (%) #Facts Extracted Relative Recall (%)
CapableOf 77 907,173 45Causes 88 3,218,388 49Desires 58 4,386,685 69HasPrerequisite 82 5,336,630 65HasProperty 62 2,976,028 48IsA 62 11,694,235 27LocatedNear 71 13,930,656 61PartOf 71 11,175,349 58SimilarSize 74 8,640,737 49.. many others … … …
29
Extension of ConceptNet by orders of magnitude
FURTHER DIRECTIONS
Tune system towards higher precision to release high-quality knowledge base
Applications enabled by commonsense knowledge Base
30
TAKE HOME MESSAGE
N-grams simulate a larger corpus N-grams embed patterns and frequency
Novel pattern ranking adapted for N-gram corpus PMI not the best choice in our case
Extracted Fact Matrix extends Conceptnet by more than 200x!
31
ADDITIONAL SLIDES FOLLOW
INACCURACIES IN CONCEPTNET• Properties are wrongly compounded
– HasProperty(apple, green yellow red)[usually] 1• Score of zero to correct tuples
– HasProperty(wallet, black)[] 0• Negated scores are infact commonsense facts
– HasProperty(jeans, blue)[not] 1• Confusing polarity for machine consumption
– HasProperty(jeans, blue)[not] 1– HasProperty(jeans, blue)[often] 1– HasProperty(jeans, blue)[usually] 1
• Wrongly labeled as hasProperty.– HasProperty(literature, book)[] 1
• Some are just facts but not commonsense– HasProperty(high point gibraltar, rock gibraltar 426 m)[] 1
RELATED WORK
IE Rule Based Pattern Based (Iterative)
joint inference Pattern Based (!Iterative)
Pros high Precision high Recall High Precision high Precision, high Recall, no drift
Cons Low Recall Low Precision (drift) Scalability
Web Scale IE
Small corpus Small domain larger corpus
Use search engine
N-grams (easy access)
Pros Manual rulemanageable
Better Precision (better statistics)
High P (reliable statistics)
High Precision, High Recall, Good Runtime
Cons Low Recall, Low Precision
Low Recall Run time, top K
CSK acquisition
Human Supplied
Hard Coded Rules Use Search Engine
(Re)use Knowledge
Pros Precise & Rich Very Precise Simple, Precise Pros of Manual + search engine
Cons Expensive, low Recall
Very Expensive, low Recall
Run time, top K
SYNTHETIC TRAINING DATA GENERATION
• Seeds overlap Matrix• Jaccard Sim(a,b)• If Sim ~ 0 , unrelated
relations• Combine Seeds from
unrelated relations to generate incorrect or negative tuples
atLocation causes hasProperty isA …
atLocation
Causes
hasProperty
isA
…
ALL RESULTS