deriving a web-scale commonsense fact database

Deriving a Web-Scale Commonsense Fact Database

Niket Tandon Gerard de Melo Gerhard Weikum Max Planck Institute for Informatics

Aug 11, 2011

SOME TRIVIAL FACTS..

Apples are green, red, juicy, sweet but not fast, funny

2

Parks and meadows are green or lively but not black or slow

Keys are kept in pocket but not in the air

Question: How do computers know? Solution: Build a commonsense knowledge base

INTRODUCTION

What is the problem? Harvest commonsense facts from text:

– Flower is soft, hasProperty(flower,soft)– Room is part of house partOf(room, house)

Why is it hard? Rarely mentioned in text Noise with natural language text

What is required to tackle the problem?Web Scale corpus but.. Web Scale corpus is hard to get! Use Web Scale N-grams => poses interesting research

challenges

3

MESSAGE OF THE TALK

N-grams simulate larger corpus

Existing information extraction models must be carefully adapted for harvesting facts.

4

AGENDA

Introduction

Web N-grams

Pattern ranking

Extraction and ranking facts

1

2

3

4

5

Pattern based Information extraction model

5

AGENDA

Introduction

Web N-grams

Pattern ranking


1

2

3

4

5


6

(fire, hot) X is very Y

(ice, cold)(flower, beautiful)

GOOD SEEDS => (GOOD) PATTERNS

7

Text •He bought very sweet apples

Seed facts •hasProperty: <apple, sweet>

Candidate patterns


8

Text •He bought very sweet apples


Candidate patterns

•[hasProperty]

•“He bought very Y X”


9

Text •He bought very sweet apples•Apples and sweet potato are delicious


Candidate patterns

•[hasProperty]

•“He bought very Y X”•“X and Y potato are delicious”


10

Text •He bought very sweet apples•Apples and sweet potato are delicious•He kept the keys in pocket


•hasLocation:<key, pocket>

Candidate patterns

•[hasProperty] [hasLocation]•“He bought very Y X” “He kept the X in Y”•“X and Y potato are delicious”

GOOD PATTERNS => (GOOD) TUPLES

He kept the butter in

refrigerator

[hasLocation] “He kept the

X in Y”

[hasLocation]

<butter, refrigerator>

11

MODEL

Fact Extraction and Ranking

Pattern rankingPattern Induction[hasProperty]

“He bought very Y X”“X and Y potato are delicious”

[hasLocation]“He kept the X in Y”

Seeds

12

STATE OF THE ART - PATTERN BASED IE

Dipre - Brin ‘98 Snowball - Agichtein et al. ‘00 KnowItAll - Etzioni et al. ’04

Observations: Low Recall on easily available corpora (large corpus is

difficult to get) Low Precision when applied to our corpus

13

AGENDA

Introduction

Web N-grams

Pattern ranking


1

2

3

4

5


14

Introduction

Web N-grams

Pattern ranking



The corpus we use to extract facts is Web Scale N-grams

WEB-SCALE N-GRAMS

N-gram: sequence of N consecutive word tokens e.g. the apples are very red

Web-scale N-gram statistics derived from trillion words e.g. the apples are very red 12000

Google N-grams, Microsoft N-grams, Yahoo N-grams N-gram dataset limitations

Length <= 5 cannot! => the apple that he was eating was very red

But... Most commonsense relations fit this small context Sheer Volume of data

15

EXAMPLE OF COMMONSENSE RELATIONS

16

OUR APPROACH Use ConceptNet data as seeds to harvest commonsense

facts from Google N-Grams corpus ConceptNet: MIT’s common sense knowledge base

constructed by crowd-sourcing and further processed We take very large number of seeds

Avoids drift with iterations We consider variations of seeds for nouns (plural form)

[key, pocket] , [keys, pocket] ,[keys, pockets] Gives very large number of potential patterns, but most are

noise Constrain patterns by Part Of Speech tags.

X<noun> is very Y<adjective> Need to carefully rank potential patterns

17

patternsfacts

AGENDA

Introduction

Web N-grams

Pattern ranking


1

2

3

4

5


18

One dirty fish spoils the whole pond!

EXISTING PATTERN RANKING APPROACH: PMI

PMI score for pattern (p) with matching seeds(x,y)

19

Raw frequencies, not distinct seeds - Bias towards rare events (strings containing seed words by chance)

Frequencies alone are not enough (spam, boilerplate text)

PATTERN RANKING – OBSERVATION 1

Observation 1: • # seeds a pattern matches

follows power law. • Unreliable patterns likely

in tail

Question 1: Can we find

patterns not in tail?

20

Score based on Observation 1 employs gradient (!

threshold)

Power-Law curve s(x) ~ axk

PATTERN RANKING – OBSERVATION 2

Observation 2: • Some patterns match many

seeds but .. match all sort of things e.g. <X> and <Y> matches seeds of different relations that we have

• PMI does not consider #relations matched

Question 2: Can we penalize them?

21

Score of a pattern:

PATTERN RANKING – OUR APPROACH

Combined Pattern Score:

22

Combine scores based on observations 1 and 2 using logistic function

IMPROVEMENT OVER PMI IN PATTERN RANKING

(ISA RELATION)

Top-Ranked Patterns (PMI) Top-Ranked Patterns (q)Varsity <Y> <X> Men <Y>/<X><Y> MLB <X> <Y> : <X> </S><Y> <X> Boys <Y> <X> </S><Y> Posters <X> Basketball <Y> - <X> </S><Y> - <X> Basketball <Y> such as <X><Y> MLB <X> NBA <S> <X> <Y><Y> Badminton <X> <X> and other <Y>

23

San Francisco and

other cities

AGENDA

Introduction

Web N-grams

Pattern ranking


1

2

3

4

5


24

ESTIMATE FACT CONFIDENCE : SIMPLE APPROACH

recipes yummy[16:130, 19:51, 21:55, 98:219, 10:80, 63:180, 29:51, 121:57]

title unique[3:111,2:63,114:91,1:213,0:788,41:246,55:95,22:112,18:75,9:48,60:64,14:71]

apples nutritious[12:144]

applet unable[11:62]

25

Pattern idFrequency

• Matches several patterns• Matches few patterns• Matches few patterns

Gives low recall but high precision

Good tuples match many patterns

OUR FACT RANKING APPROACH

These pattern count feature vectors are used to learn a Decision Tree.

Gives facts with estimated confidence

P1 P2 P3 … labelTuple 1 f(T1,P1) f(T1,P2) f(T1,P2) Positive

Tuple 2 f(T2,P1) f(T2,P2) f(T3,P2) Negative

Tuple 3 f(T2,P1) f(T2,P2) f(T3,P2) Positive

…

26

RECAP: MODEL

Construct Seeds from Conceptnet,

Pattern Induction over Google 5-grams

Pattern Ranking (match many seeds but not too many)

Fact Extraction with clean patterns

over Google 5-grams

27

EXPERIMENTAL SETUP

Test Data (true and false labels): Randomly chosen high confidence facts from ConceptNet

Precision and Recall computed using 10-fold cross-validation over the test data

Classifier used: Decision Trees with Adaptive Boosting

28

RESULTS – MORE THAN 200 MILLION FACTS EXTRACTED

Relation Precision (%) #Facts Extracted Relative Recall (%)

CapableOf 77 907,173 45Causes 88 3,218,388 49Desires 58 4,386,685 69HasPrerequisite 82 5,336,630 65HasProperty 62 2,976,028 48IsA 62 11,694,235 27LocatedNear 71 13,930,656 61PartOf 71 11,175,349 58SimilarSize 74 8,640,737 49.. many others … … …

29

Extension of ConceptNet by orders of magnitude

FURTHER DIRECTIONS

Tune system towards higher precision to release high-quality knowledge base

Applications enabled by commonsense knowledge Base

30

TAKE HOME MESSAGE

N-grams simulate a larger corpus N-grams embed patterns and frequency

Novel pattern ranking adapted for N-gram corpus PMI not the best choice in our case

Extracted Fact Matrix extends Conceptnet by more than 200x!

31

32

Thank [email protected]

hasProperty(flower , *)

ADDITIONAL SLIDES FOLLOW

INACCURACIES IN CONCEPTNET• Properties are wrongly compounded

– HasProperty(apple, green yellow red)[usually] 1• Score of zero to correct tuples

– HasProperty(wallet, black)[] 0• Negated scores are infact commonsense facts

– HasProperty(jeans, blue)[not] 1• Confusing polarity for machine consumption

– HasProperty(jeans, blue)[not] 1– HasProperty(jeans, blue)[often] 1– HasProperty(jeans, blue)[usually] 1

• Wrongly labeled as hasProperty.– HasProperty(literature, book)[] 1

• Some are just facts but not commonsense– HasProperty(high point gibraltar, rock gibraltar 426 m)[] 1

RELATED WORK

IE Rule Based Pattern Based (Iterative)

joint inference Pattern Based (!Iterative)

Pros high Precision high Recall High Precision high Precision, high Recall, no drift

Cons Low Recall Low Precision (drift) Scalability

Web Scale IE

Small corpus Small domain larger corpus

Use search engine

N-grams (easy access)

Pros Manual rulemanageable

Better Precision (better statistics)

High P (reliable statistics)

High Precision, High Recall, Good Runtime

Cons Low Recall, Low Precision

Low Recall Run time, top K

CSK acquisition

Human Supplied

Hard Coded Rules Use Search Engine

(Re)use Knowledge

Pros Precise & Rich Very Precise Simple, Precise Pros of Manual + search engine

Cons Expensive, low Recall

Very Expensive, low Recall

Run time, top K

SYNTHETIC TRAINING DATA GENERATION

• Seeds overlap Matrix• Jaccard Sim(a,b)• If Sim ~ 0 , unrelated

relations• Combine Seeds from

unrelated relations to generate incorrect or negative tuples

atLocation causes hasProperty isA …

atLocation

Causes

hasProperty

isA

…

ALL RESULTS

deriving a web-scale commonsense fact database

Documents