extracting multilingual natural-language patterns for rdf predicates

16
AKSW, Universität Leipzig BOA Extracting Multilingual Natural-Language Patterns for RDF Predicates Daniel Gerber Axel-Cyrille Ngonga Ngomo

Upload: daniel-gerber

Post on 07-Jul-2015

5.329 views

Category:

Documents


0 download

DESCRIPTION

Most knowledge sources on the Data Web were extracted from structured or semi-structured data. Thus, they encompass solely a small fraction of the information available on the document-oriented Web. In this paper, we present BOA, a bootstrapping strategy for ex- tracting RDF from text. The idea behind BOA is to extract natural-language patterns that represent predicates found on the Data Web from unstructured data by using background knowledge from the Data Web. These patterns are then used to extract instance knowledge from natural-language text. This knowledge is finally fed back into the Data Web, therewith closing the loop. The approach followed by BOA is quasi independent of the language in which the corpus is written. We demonstrate our approach by applying it to four different corpora and two different languages. We evaluate BOA on these data sets using DBpedia as background knowledge. Our results show that we can extract several thousand new facts in one iteration with very high accuracy.

TRANSCRIPT

Page 1: Extracting Multilingual Natural-Language Patterns for RDF Predicates

AKSW, Universität Leipzig

BOAExtracting Multilingual Natural-Language Patterns for RDF Predicates

Daniel Gerber Axel-Cyrille Ngonga Ngomo

Page 2: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

Motivation

๏ Most knowledge bases are extracted from (semi)-structured data

๏ Only 15-20 % of information in structured data

๏ Semantic Web ⬌ Document Web

๏ How can we extract data from the document-oriented web?

2

Page 3: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

Idea I

3

dbpedia:Barack_Obama

dbpedia:Honolulu,_Hawaii

dbpedia:Democratic_Party

dbpedia:Michelle_Obama

dbpedia-owl:birthPlace

dbpedia-owl:party

dbpedia-owl:spouse

Page 4: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

Idea II

Barack Obama was born in Honolulu, Hawaii.

Barack Hussein Obama is a politician of the Democratic Party.

Obama married Michelle Robinson in 1992.

4

is a politician of the

met

was born in

Page 5: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

Idea III

5

is a politician of the married

was born in

Joseph Martin "Joschka" Fischer (born 1948-04-12) is a politician of the German Green Party.

Dietrich's only child, Maria Elisabeth Sieber, was born in Berlin on 13 December 1924.

Jackie Bouvier Kennedy Onassis who married John F. Kennedy was tied to the Auchinclosses via her sister's marriage into the Auchincloss family.

Page 6: Extracting Multilingual Natural-Language Patterns for RDF Predicates

EKAW - http://boa.aksw.org10.10.2012- Page

Bootstrapping the Data Web

The BOA approach

6

Data Web

Web

Corpora

Surfaceforms

Patterns

SPARQL

Search & Filter

Filter

FeatureExtraction

Generation

Corpus Extraction Module

Crawler

Cleaner

Indexer

NeuralNetwork

1

2

3 4

5 6

7

8

3 4

Page 7: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

Pattern Search

(1) Set of entities s and o connected through p(2) Find all sentences which contain s and o(3) Replace labels with variables (?D?, ?R?)

7

BOA pattern: BOA pattern mapping:

“?D? with his wife ?R?”

“?D? with his wife ?R?”

“?D? and his wife ?R?”

“?D? and her husband ?R?”

dbpedia-owl:spouse

Page 8: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

Feature Extraction - Language Independent

8

Supportpattern should be used across several triples

๏ Google - DoubleClick: 2

๏ General Motors - Opel:1

๏ Cablevision - Rainbow Media: 4

subsidiary ↣ “?Company was acquired by ?Company”Specificitypattern should not be used by many pattern mappings

๏ subsidiary:

“?R? is a part of ?D?”

๏ foundationOrg:

“?R? is a part of ?D?”

Typicitypattern should be used to connect entities of correct type

๏ Hypercom_ORG was_O

acquired_O by_O

Verifone_ORG ._O

Page 9: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

Feature Extraction - Language Dependent

9

rdfs:label

dbpedia:subsidiary

Intrinsic Information Content Metric

“subsidiary”@en

?D? was acquired by ?R?

Wordnet

ReVerb

๏ Open Information Extraction

๏ Patterns need to abide a POS

tag sequence

๏ Logistic Regression Classifier

Page 10: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

BOA Neuronal Network

10

Input Layer[0,1]

Hidden Layer Output Layer[0,1]

Reverb

Specificity

IICM

Typicity

๏ 200 patterns are manually classified as good (1) or bad (0)

๏ up to 18 features, depending on language

Page 11: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

RDF Generation

11

dbpedia-owl:spouse

‘‘Leyla Rodriguez Stahl’’@en

rdfs:label

‘‘Abel Pacheco’’@en

rdfs:label

dbpedia-owl:Person

rdf:type

dbpedia-owl:Person

rdf:type

Pacheco_PER arrived_O with_O his_O wife_O Leyla_PER Rodriguez_PER Stahl_PER and_O

?D? with his wife ?R?

Pacheco arrived with his wife Leyla Rodriguez Stahl and several...

boa:Leyla_Rodriguez_Stahldbpedia:Abel_PachecoNEW NEW

NEW

NEW

Page 12: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

Evaluation I

12

en-wiki en-news de-wiki de-news

Language english english german german

Topic general knowledge news general knowledge news

# of sentences 58M 214,2M 24,6M 112,8M

# of tokens per sentence 21,4 22,1 17,4 18,3

Page 13: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

Evaluation II

13

en-wiki en-news de-wiki de-news

# of pattern mappings 125 44 66 19

# of patterns 9551 586 7366 109

# of new triples 78944 22883 10138 883

# of known triples 1829 798 655 42

# of found triples 80773 3081 10793 925

Precision Top-100 92 % 70 % 91 % 74 %

Page 14: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

Conclusion

๏ No manual created seed patterns needed๏ > 90% precision for german an english dataset๏ high recall through surface forms๏ Output easily integrable in LOD Cloud๏ Library of natural-language representations of

formal relations, Demo

14

Page 15: Extracting Multilingual Natural-Language Patterns for RDF Predicates

Bootstrapping the Data Web

EKAW - http://boa.aksw.org10.10.2012 - Page

BOA Graphical User Interface

15

http://boa.aksw.org

Page 16: Extracting Multilingual Natural-Language Patterns for RDF Predicates

LOD2 Presentation . 02.09.2010 . Page http://lod2.eu

Thank you!Questions?

Daniel GerberAugustusplatz 10, Room P61604109 Leipzig, GermanySIMBA@AKSWhttp://bis.informatik.uni-leipzig.de/DanielGerberhttp://boa.aksw.orghttp://code.google.com/p/boa