fabian m. suchanek sofie: a self-organizing framework for information extraction 1 sofie: a...

32
SOFIE: A Self-Organizing Framework for Information Ex traction 1 Fabian M. Suchanek SOFIE: A Self-Organizing Framework for Information Extraction Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum (Max-Planck-Institute for Informatics, Saarbrücken, Germany)

Upload: philomena-craig

Post on 13-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

SOFIE: A Self-Organizing Framework for Information Extraction 1Fabian M. Suchanek

SOFIE:A Self-Organizing

Frameworkfor Information Extraction

Fabian M. Suchanek, Mauro Sozio, Gerhard Weikum

(Max-Planck-Institute for Informatics, Saarbrücken, Germany)

SOFIE: A Self-Organizing Framework for Information Extraction 2Fabian M. Suchanek

Ontologies

SingerCountry

USA

Entity

bornInPlace

typetype

subclassOfsubclassOf

Wikipedia

DBpedia,

YAGO,

KYLIN,

...

Internet

?"Elvis died in England"

birth-place: USA

SOFIE: A Self-Organizing Framework for Information Extraction 3Fabian M. Suchanek

Information Extraction

EnglanddiedInPlace

"Elvis died in England"

Previous approaches:

Espresso, DIPRE, LEILA, Snowball, TextRunner, Alice, and many more

Goal:

Extract ontological information from natural language documents

May deliver non-canonic relations ر

May deliver non-canonic entities ر

May deliver inconsistent facts ر

recoverWithout(most_people, medication)

areUnder(0%, the_age_of_18)

support(these_findings, the_notion)

died in, perished in, was killed in

England, UK, Great Britain

diedInPlace(Elvis, England)diedInPlace(Elvis, Germany)

SOFIE aims to solve these problems in a new unified framework

SOFIE: A Self-Organizing Framework for Information Extraction 4Fabian M. Suchanek

Pitfalls of Information Extraction

Elvis died in England.

OntologyWeb page

Louis XIV died in France.

FrancediedInPlace

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

SOFIE: A Self-Organizing Framework for Information Extraction 5Fabian M. Suchanek

Pitfalls of Information Extraction

Elvis died in England.

OntologyWeb page

Louis XIV died in France.

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

If a meaningful pattern occurs with two entities, then the entities stand in the relation.

"Elvis"

"England"diedInPlace

SOFIE: A Self-Organizing Framework for Information Extraction 6Fabian M. Suchanek

Pitfalls of Information Extraction

Elvis died in England.

OntologyWeb page

Louis XIV died in France.

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

If a meaningful pattern occurs with two entities, then the entities stand in the relation.

"Elvis"

"England"diedInPlace

Taxidophobist?

SOFIE: A Self-Organizing Framework for Information Extraction 7Fabian M. Suchanek

Pitfalls of Information Extraction

Elvis died in England.

Web page

Louis XIV died in France.

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

If a meaningful pattern occurs with two entities, then the entities stand in the relation.

"Elvis"

"England"diedInPlace

Taxidophobist

Reasoning Problem

SOFIE: A Self-Organizing Framework for Information Extraction 8Fabian M. Suchanek

Pitfalls of Information Extraction

Elvis died in England.

Web page

Louis XIV died in France.

If a pattern occurs with two entities that stand in a relation, then the pattern maps to the relation.

"died in" = diedInPlace

If a meaningful pattern occurs with two entities, then the entities stand in the relation.

Taxidophobist

Reasoning Problem

Disambiguation Problem

SOFIE: A Self-Organizing Framework for Information Extraction 9Fabian M. Suchanek

Pitfalls of Information Extraction

Elvis died in England.Louis XIV died in France.

Taxidophobist

Reasoning Problem

Disambiguation Problem

Pattern Matching Problem

"died in" = diedInPlace ?

SOFIE: A Self-Organizing Framework for Information Extraction 10Fabian M. Suchanek

Information Extraction as Formulas

type(Elvis,Taxidophobist).

type(X,Taxidophobist)

& bornInPlace(X,Y)

=> diedInPlace(X,Z) [0.8]

Taxidophobist

Reasoning Problem

SOFIE: A Self-Organizing Framework for Information Extraction 11Fabian M. Suchanek

Information Extraction as Formulas

Elvis died in England.Louis XIV died in France.

Reasoning Problem

Disambiguation Problem

Pattern Matching Problem

"died in" = diedInPlace ?

type(X,Taxidophobist)

& bornInPlace(X,Y)

=> diedInPlace(X,Z)

type(Elvis,Taxidophobist).

SOFIE: A Self-Organizing Framework for Information Extraction 12Fabian M. Suchanek

Assumptions:

In one document, the same word has always the same meaning ر

The ontology already knows all important meanings of proper رnames

possibleMeaning(Elvis@D15, ElvisPresley). [0.7]

Information Extraction as Formulas

Disambiguation Problem

SOFIE: A Self-Organizing Framework for Information Extraction 13Fabian M. Suchanek

Assumptions:

In one document, the same word has always the same meaning ر

The ontology already knows all important meanings of proper رnames

possibleMeaning(Elvis@D15, ElvisPresley). [0.7]

A word in context (wic).Here: The word "Elvis"

in document D15

One possible meaning of "Elvis" as given by the ontology

Prior estimation for the likelihood of this meaning.

Information Extraction as Formulas

| words(D15) ∩ rel(ElvisPresley)|

| words(D15) |

SOFIE: A Self-Organizing Framework for Information Extraction 14Fabian M. Suchanek

Assumptions:

In one document, the same word has always the same meaning ر

The ontology already knows all important meanings of proper رnames

possibleMeaning(Elvis@D15, ElvisPresley). [0.7]

Information Extraction as Formulas

possibleMeaning(X,Y) => means(X,Y)

means(X,Y) & YZ => means(X,Z)

SOFIE: A Self-Organizing Framework for Information Extraction 15Fabian M. Suchanek

Information Extraction as Formulas

Elvis died in England.Louis XIV died in France.

Reasoning Problem

Disambiguation Problem

Pattern Matching Problem

"died in" = diedInPlace ?

type(X,Taxidophobist)

& bornInPlace(X,Y)

=> diedInPlace(X,Z)

type(Elvis,Taxidophobist).

meaning(Elvis@D15,

ElvisPresley). [0.7]

SOFIE: A Self-Organizing Framework for Information Extraction 16Fabian M. Suchanek

Information Extraction as Formulas

Elvis died in England.Louis XIV died in France.

Pattern Matching Problem

"died in" = diedInPlace ?

occurs("died in",

Elvis@D15,

England@D15). [14]

occurs(P,Wic1,Wic2) & means(Wic1,X) & means(Wic2,Y) & mapsTo(P,R)

=> R(X,Y)

occurs(P,Wic1,Wic2) & means(Wic1,X) & means(Wic2,Y) & R(X,Y)

=> mapsTo(P,R)

SOFIE: A Self-Organizing Framework for Information Extraction 17Fabian M. Suchanek

Information Extraction as Formulas

Reasoning Problem

Disambiguation Problem

Pattern Matching Problem

type(X,Taxidophobist)

& bornInPlace(X,Y)

=> diedInPlace(X,Z)

type(Elvis,Taxidophobist).

meaning(Elvis@D15,

ElvisPresley). [0.7]

occurs("died in",

Elvis@D15,

England@D15). [14]

Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized

means(Elvis@D15, ElvisPresley) ?

mapsTo("died In", diedInPlace) ?

diedIn(ElvisPresley, England) ?

SOFIE: A Self-Organizing Framework for Information Extraction 18Fabian M. Suchanek

Weighted MAX SAT Problem

Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized

Problems:

The Weighted MAX SAT Problem is NP-hard ر

Our instance of the problem is huge ر

The most popular greedy approximation algorithm ر(Johnson's) does not work well with our type of formulas

Weighted MAX SAT Problem

Johnson's has upper bound 2/3 on

approximation

bornInPlace(X,Y) => bornInPlace(X,Z)

A v B A v C B v C

Structurally much simpler than MLNs. No need to

model probabilities if we're just interested in the

maximum.

SOFIE: A Self-Organizing Framework for Information Extraction 19Fabian M. Suchanek

A v B [w1]

A v B [w2]

B v C [w3]

C [w4]

Formulas

A

B

C

Hypotheses

The Functional MAX SAT Algorithm considers only unit clauses.

= true

= false

= false

FMS Algorithm

The Functional MAX SAT Algorithm propagates Dominating Unit Clauses

A v B [10]

A [10]

A [30]

A = true30 > 10+10

SOFIE: A Self-Organizing Framework for Information Extraction 20Fabian M. Suchanek

FMS Algorithm

Experiments show better performance in practice than Johnson's algorithm

in our setting .

FMS Algorithm

FOR i=1 TO 42...NEXT i

Approximation

Guarantee

Polynomial

time

SOFIE: A Self-Organizing Framework for Information Extraction 21Fabian M. Suchanek

FMS Algorithm

FOR i=1 TO 42...NEXT i

FMS Algorithm

Elvis died in England r(X,Y) & s(Y) => t(X,Y)

SOFIE: A Self-Organizing Framework for Information Extraction 22Fabian M. Suchanek

England

FMS Algorithm

diedIn

St. Elvis

FMS Algorithm

FOR i=1 TO 42...NEXT i

Elvis died in England

type(Elvis,Taxidophobist)=1diedIn(Elvis,England)=0means(Elvis@D15,Elvis)=0means(Elvis@D15,...)=1

r(X,Y) & s(Y) => t(X,Y)

SOFIE: A Self-Organizing Framework for Information Extraction 23Fabian M. Suchanek

England

SOFIE

diedIn

St. Elvis

r(X,Y) & s(Y) => t(X,Y)

SOFIE: A Self-Organizing Framework for Information Extraction 24Fabian M. Suchanek

Corpus Type # Docs Relations Time Precision

Wikipedia toy corpus structured 100 3 2min 100%

Wikipedia subcorpus

semi-structured 2000 15 15h 94%

News article toy corpus unstructured 150 1 24min 91%

Biographies from Web unstructured 3440 5 15h 90%

Other Experiments

(All experiments with the YAGO ontology)

SOFIE: A Self-Organizing Framework for Information Extraction 25Fabian M. Suchanek

SOFIE unifies the tasks of

entity disambiguation ر

pattern extraction ر

semantic constraint reasoning ر

in a single framework, delivering

canonicalized facts ر

of high precision ر

Conclusion

died in England... but is alive!

http://mpii.de/yago-naga

s(Y) => t(X)

SOFIE: A Self-Organizing Framework for Information Extraction 26Fabian M. Suchanek

occurs(P,WX,WY)

/\ refersTo(WX.X)

/\ refersTo(WY,Y)

/\ R(X,Y)

=> expresses(P,R)

occurs(P,WX,WY)

/\ expressed(P,R)

/\ refersTo(WX.X)

/\ refersTo(WY,Y)

/\ range(R,D1)

/\ domain(R,D2)

/\ type(X,D1)

/\ type(Y,D2)

=> R(X,Y) R(X,Y)

R(X,Y)

/\ R(X,Z)

/\ type(R,function)

=> Y = Z

disambiguationPrior(W,X) => refersTo(W,X)

bornInYear(X,B) /\ diedInYear(X,D) => B<D

SOFIE rules!

SOFIE: A Self-Organizing Framework for Information Extraction 27Fabian M. Suchanek

SOFIE: Experiments

Corpus Type # Docs Relations Time Precision Recall

Wikipedia toy corpus structured 100 3 8min 100% 80%

Wikipedia toy corpus

semi-structured 50% infoboxes removed 100 3 8min 100% 57%

Wikipedia subcorpus semi-structured 2000 15 15h 94% ?

News article toy corpus unstructured 150 1 24min 91% 24%, 31%

Snowball 56% 31%

Biographies from Web unstructured 3440 5 15h 90% ?

SOFIE: A Self-Organizing Framework for Information Extraction 28Fabian M. Suchanek

SOFIE: Large-Scale Experiment

Goal:

Extract bornIn, bornOnDate, diedIn, diedOnDate, politicianOf

Corpus:

3700 biography documents downloaded from the Web

Runtime: (summed over 5 batches)

Parsing 7:05h

Hypothesis Generation 6:15h

Solving 2:30h

Total 15:50h

Results: (precision in %)

bornIn bornOnD diedIn diedOnD polOf

87 87 13 98 95 90

SOFIE: A Self-Organizing Framework for Information Extraction 29Fabian M. Suchanek

SOFIE: Relation to Markov Logic

P

bornIn(Nicholas, Patras)

false true

P(X) ~ e sat(i,X) wi

Number of satisfied instances of the ith formula

Weight of the ith formula

r(x,y) /\ s(x,z) => t(x,z) [w]

...

max X e sat(i,X) wi

max X log( e sat(i,X) wi )

max X sat(i,X) wi

~~~~> Weighted MAX SAT problem

SOFIE: A Self-Organizing Framework for Information Extraction 30Fabian M. Suchanek

Grounding

r(X,Y) & s(Y) => t(X,Y)

{ r(X,Y), s(Y), t(X,Y) }

{ r(a,a), s(a), t(a,a) }

{ r(a,b), s(b), t(a,b) }

{ r(b,a), s(a), t(b,a) }

{ r(b,b), s(b), t(b,b) }

r(a,a)

r(a,b)

r(b,a)

r(b,b)

Immutable, complete facts (e.g. pattern occurrences)

Entities={a,b}

SOFIE: A Self-Organizing Framework for Information Extraction 31Fabian M. Suchanek

Grounding

r(X,Y) & s(Y) => t(X,Y)

{ r(X,Y), s(Y), t(X,Y) }

{ s(a), t(a,a) } [w]

r(a,a) [w]

r(a,b)

r(b,a)

r(b,b)

Immutable, complete facts (e.g. pattern occurrences)

SOFIE: A Self-Organizing Framework for Information Extraction 32Fabian M. Suchanek

Grounding

{ s(a), t(a,a) } [w1]

{p(c,d), q(e), } [w2]

Find truth assignments to hypotheses so that the weight of satisfied formulas is maximized

means(Elvis@D15, ElvisPresley) = true ?

mapsTo("died In", diedInPlace) = true ?

diedIn(ElvisPresley, England) = true ?