f rom u nstructured i nformation t o l inked d ata axel ngonga head of simba@aksw university of...
TRANSCRIPT
![Page 1: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/1.jpg)
FROM UNSTRUCTURED INFORMATION TO LINKED DATA
Axel Ngonga
Head of SIMBA@AKSW
University of Leipzig
IASLOD, August 15/16th 2012
![Page 2: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/2.jpg)
Motivation
![Page 3: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/3.jpg)
Motivation• Where does the LOD Cloud come from?
• Structured data• Triplify, D2R
• Semi-structured data• DBpedia
• Unstructured data• ???
• Unstructured data make up 80% of the Web• How do we extract Linked Data from unstructured data
sources?
![Page 4: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/4.jpg)
Overview
1. Problem Definition
2. Named Entity Recognition• Algorithms• Ensemble Learning
3. Relation Extraction• General approaches• OpenIE approaches
4. Entity Disambiguation• URI Lookup• Disambiguation
5. Conclusion
NB: Will be mainly concerned with the newest developments.
![Page 5: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/5.jpg)
Overview
1. Problem Definition
2. Named Entity Recognition• Algorithms• Ensemble Learning
3. Relation Extraction• General approaches• OpenIE approaches
4. Entity Disambiguation• URI Lookup• Disambiguation
5. Conclusion
![Page 6: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/6.jpg)
Problem Definition• Simple(?) problem: given a text fragment, retrieve
• All entities and• relations between these entities automatically plus• „ground them“ in an ontology
• Also coined Knowledge Extraction
John Petrucci was born in New York.
:John_Petrucci:New_York
dbo:birthPlace
:John_Petrucci dbo:birthPlace :New_York .
![Page 7: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/7.jpg)
Problems
1. Finding entities
Named Entity Recognition
2. Finding relation instances
Relation Extraction
3. Finding URIs
URI Disambiguation
![Page 8: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/8.jpg)
Overview
1. Problem Definition
2. Named Entity Recognition• Algorithms• Ensemble Learning
3. Relation Extraction• General approaches• OpenIE approaches
4. Entity Disambiguation• URI Lookup• Disambiguation
5. Conclusion
![Page 9: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/9.jpg)
Named Entity Recognition• Problem definition: Given a set of classes, find all
strings that are labels of instances of these classes within a text fragment
John Petrucci was born in New York.
[John Petrucci, PER] was born in [New York, LOC].
![Page 10: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/10.jpg)
Named Entity Recognition• Problem definition: Given a set of classes, find all
strings that are labels of instances of these classes within a text fragment
• Common sets of classes• CoNLL03: Person, Location, Organization, Miscelleaneous• ACE05: Facility, Geo-Political Entity, Location, Organisation,
Person, Vehicle, Weapon• BioNLP2004: Protein, DNA, RNA, cell line, cell type
• Several approaches• Direct solutions (single algorithms)• Ensemble Learning
![Page 11: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/11.jpg)
NER: Overview of approaches• Dictionary-based• Hand-crafted Rules• Machine Learning
• Hidden Markov Model (HMMs)• Conditional Random Fields (CRFs)• Neural Networks• k Nearest Neighbors (kNN)• Graph Clustering
• Ensemble Learning• Veto-Based (Bagging, Boosting)• Neural Networks
![Page 12: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/12.jpg)
NER: Dictionary-based• Simple Idea
1. Define mappings between words and classes, e. g., Paris Location
2. Try to match each token from each sentence
3. Return the mapping entities
Time-Efficient at runtime× Manuel creation of gazeteers× Low Precision (Paris = Person, Location)× Low Recall (esp. on Persons and Organizations as the
number of instances grows)
![Page 13: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/13.jpg)
NER: Rule-based• Simple Idea
1. Define a set of rule to find entities, e.g., [PERSON] was born in [LOCATION].
2. Try to match each sentence to one or several rules
3. Return the mapping entities
High precision × Manuel creation of rules is very tedious × Low recall (finite number of patterns)
![Page 14: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/14.jpg)
NER: Markov Models• Stochastic process such that (Markov Property)
) = )
• Equivalent to finite-state machine• Formally consists of
• Set S of states S1, … , Sn
• Matrix M such that mij = P(Xt+1=Sj|Xt=Si)
![Page 15: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/15.jpg)
NER: Hidden Markov Models• Extension of Markov Models
• States are hidden and assigned an output function• Only output is seen• Transitions are learned from training data
• How do they work?• Input: Discrete sequence of features
(e.g., POS Tags, word stems, etc.)• Goal: Find the best sequence of states
that represent the input• Output: hopefully right classification
of each token
S0
S1
…
Sn
PER
_
LOC
![Page 16: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/16.jpg)
NER: k Nearest Neighbors• Idea
• Describe each token q from a labelled training data set with a set of features (e.g., left and right neigbors)
• Each new token t is described with the same features
• Assign t the class of its k nearest neighbors
![Page 17: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/17.jpg)
NER: So far …• „Simple approaches“
• Apply one algorithm to the NER problem• Bound to be limited by assumptions of model
• Implemented by a large number of tools• Alchemy• Stanford NER• Illinois Tagger• Ontos NER Tagger• LingPipe• …
![Page 18: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/18.jpg)
NER: Ensemble Learning• Intuition: Each algorithm has its strengths and
weaknesses• Idea: Use ensemble learning to merge results of different
algorithms so as to create a meta-classifier of higher accuracy
Dictionary-based
approaches Pattern-based
approaches
Condition Random FieldsSupport Vector
Machines
![Page 19: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/19.jpg)
NER: Ensemble Learning• Idea: Merge the results of several approaches for
improving results• Simplest approaches:
• Voting• Weighted voting
Input
System 1 System 2 System n
Merger
Output
![Page 20: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/20.jpg)
NER: Ensemble Learning• When does it work?• Accuracy
• Need for exisiting solutions to be „good“• Merging random results lead to random results• Given, current approaches reach 80% F-Score
• Diversity• Need for smallest possible amount of correlation
between approaches• E.g., merging two HMM-based taggers won‘t help• Given, large number of approaches for NER
![Page 21: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/21.jpg)
NER:FOX• Federated Knowledge Extraction Framework• Idea: Apply ensemble learning to NER• Classical approach: Voting
• Does not make use of systematic error• Partly difficult to train
• Use neural networks instead• Can make use of systematic
errory• Easy to train• Converge fast• http://fox.aksw.org
![Page 22: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/22.jpg)
NER: FOX
![Page 23: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/23.jpg)
NER: FOX on MUC7
![Page 24: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/24.jpg)
NER: FOX on MUC7
![Page 25: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/25.jpg)
NER: FOX on Website Data
![Page 26: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/26.jpg)
NER: FOX on Website Data
![Page 27: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/27.jpg)
NER: FOX on Companies and Countries
No runtime issues (parallel implementation) NN overhead is small× Overfitting
![Page 28: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/28.jpg)
NER: Summary• Large number of approaches
• Dictionaries• Hand-Crafted rules• Machine Learning• Hybrid• …
Combining approaches leads to better results than single algorithms
![Page 29: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/29.jpg)
Overview
1. Problem Definition
2. Named Entity Recognition• Algorithms• Ensemble Learning
3. Relation Extraction• General approaches• OpenIE approaches
4. Entity Disambiguation• URI Lookup• Disambiguation
5. Conclusion
![Page 30: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/30.jpg)
RE: Problem Definition• Find the relations between NEs if such relations exist.• NEs not always given a-priori (open vs. closed RE)
bornIn ([John Petrucci, PER], [New York, LOC]).
John Petrucci was born in New York.
[John Petrucci, PER] was born in [New York, LOC].
![Page 31: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/31.jpg)
RE: Approaches• Hand-crafted rules• Pattern Learning• Coupled Learning
![Page 32: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/32.jpg)
RE: Pattern-based• Hearst patterns [Hearst: COLING‘92]
• POS-enhanced regular expression matching in natural-language text
NP0 {,} such as {NP1, NP2, … (and|or) }{,} NPn
NP0 {,}{NP1, NP2, … NPn-1}{,} or other NPn
“The bow lute, such as the Bambara ndang, is plucked and has an individual curved neck for each string.”
isA(“Bambara ndang”, “bow lute”) Time-Efficient at runtime× Very low recall× Not adaptable to other relations
![Page 33: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/33.jpg)
RE: DIPRE• DIPRE = Dual Iterative Pattern Relation Extraction• Semi-supervised, iterative gathering of facts and patterns• Positive & negative examples as seeds for a given target
relation• e.g. +(Hillary, Bill) ; +(Carla, Nicolas); –(Larry, Google)
• Various tuning parameters for pruning low-confidence patterns and facts
• Extended to SnowBall / QXtract
(Hillary, Bill)
(Carla, Nicolas)X and her husband Y
X and Y on their honeymoon
X and Y and their childrenX has been dating with YX loves Y
(Angelina, Brad)
(Hillary, Bill)
(Victoria, David)
(Carla, Nicolas)
(Larry, Google)…
![Page 34: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/34.jpg)
RE: NELL• Never-Ending Language Learner (http://rtw.ml.cmu.edu/)• Open IE with ontological backbone
• Closed set of categories & typed relations
• Seeds/counter seeds (5-10)• Open set of predicate arguments
(instances)• Coupled iterative learners • Constantly running over a large Web corpus
since January 2010 (200 Mio pages)• Periodic human supervision
athletePlaysForTeam(Athlete, SportsTeam)
athletePlaysForTeam(Alex Rodriguez, Yankees)
athletePlaysForTeam(Alexander_Ovechkin, Penguins)
![Page 35: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/35.jpg)
RE: NELL
Conservative strategy Avoid Semantic Drift
![Page 36: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/36.jpg)
RE: BOA• Bootstrapping Linked Data (http://boa.aksw.org)• Core idea: Use instance data in Data Web to discover NL
patterns and new instances
![Page 37: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/37.jpg)
RE: BOA• Follows conservative strategy
• Only top pattern• Frequency threshold• Score Threshold
• Evaluation results
![Page 38: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/38.jpg)
RE: Summary• Several approaches
• Hand-crafted rules• Machine Learning• Hybrid
Large number of instances available for many relations Runtime problem Parallel implementations Many new facts can be found× Semantic Drift× Long tail× Entity Disambiguation
![Page 39: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/39.jpg)
Overview
1. Problem Definition
2. Named Entity Recognition• Algorithms• Ensemble Learning
3. Relation Extraction• General approaches• OpenIE approaches
4. Entity Disambiguation• URI Lookup• Disambiguation
5. Conclusion
![Page 40: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/40.jpg)
ED: Problem Definition• Given (a) refence knowledge base(s), a text fragment, a
list of NEs (incl. position), and a list a relations, find URIs for each of the NEs and relations
• Very difficult problem• Ambiguity, e.g., Paris = Paris Hilton? Paris (France)?• Difficult even for humans, e.g.,• Paris‘ mayor died yesterday
• Several solutions• Indexing• Surface Form• Graph-based
![Page 41: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/41.jpg)
ED: Problem Definition
bornIn ([John Petrucci, PER], [New York, LOC]).
John Petrucci was born in New York.
[John Petrucci, PER] was born in [New York, LOC].:John_Petrucci dbo:birthPlace :New_York .
![Page 42: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/42.jpg)
ED: Indexing• More retrieval than disambiguation• Similar to dictionary-based approaches• Idea
• Index all labels in reference knowledge base• Given an input label, retrieve all entities with a similar
label× Poor recall (unknown surface form, e.g., „Mme Curie“ für
„Marie Curie“)× Low precision (Paris = Paris Hilton, Paris (France), …)
![Page 43: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/43.jpg)
ED: Type Disambiguation• Extension of indexing
• Index all labels• Infer type information• Retrieve labels from entities of the given type
• Same recall as previous approach• Higher precision
• Paris[LOC] != Paris[PER]• Still, Paris (France) vs. Paris (Ontario)
• Need for context
![Page 44: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/44.jpg)
ED: Spotlight• Known surface forms (http://dbpedia.org/spotlight)
• Based on DBpedia + Wikipedia• Uses supplementary knowledge including disambiguation
pages, redirects, wikilinks• Three main steps
• Spotting: Finding possible mentions of DBpedia resources, e.g.,
John Petrucci was born in New York.• Candidate Selection: Find possible URIs, e.g.,
John Petrucci :JohnPetrucciNew York :New_York, :New_York_County, …
• Disambiguation: Map context to vector for each resource New York :New_York
![Page 45: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/45.jpg)
ED: YAGO2• Joint Disambiguation
Mississippi, one of Bob’s later songs, was first recorded by Sheryl on her album.
♬
![Page 46: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/46.jpg)
ED: YAGO2
Mississippi (State)
Bob Dylan Songs
Sheryl Cruz
Sheryl Lee
Mississippi (Song)
Sheryl Crow
Objective: Maximize objective function (e.g., total weight)
Constraint: Keep at least one entity per mention
Mentions of Entities Entity Candidatessim(cxt(ml ),cxt(ei ))
prior(ml ,ei )
coh(ei ,ej )
![Page 47: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/47.jpg)
ED: FOX• Generic Approach
• A-priori score (a): Popularity of URIs• Similarity score (s): Similarity of resource labels and text• Coherence score (z): Correlation between URIs
49
|a s
|a sz
![Page 48: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/48.jpg)
ED:FOX• Allows the use of several algorithms
• HITS• Pagerank• Apriori• Propagation Algorithms• …
50
![Page 49: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/49.jpg)
ED: Summary• Difficult problem even for humans• Several approaches
• Simple search• Search with restrictions• Known surface forms• Graph-based
Improved F-Score for DBpedia (70-80%)× Low F-Score for generic knowledge bases× Intrinsically difficult× Still a lot to do
![Page 50: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/50.jpg)
Overview
1. Problem Definition
2. Named Entity Recognition• Algorithms• Ensemble Learning
3. Relation Extraction• General approaches• OpenIE approaches
4. Entity Disambiguation• URI Lookup• Disambiguation
5. Conclusion
![Page 51: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/51.jpg)
Conclusion• Discussed basics of …
• Knowledge Extraction problem• Named Entity Recognition• Relation Extraction• Entity Disambiguation
• Still a lot of research necessary• Ensemble and active Learning• Entity Disambiguation• Question Answering …
![Page 52: F ROM U NSTRUCTURED I NFORMATION T O L INKED D ATA Axel Ngonga Head of SIMBA@AKSW University of Leipzig IASLOD, August 15/16 th 2012](https://reader036.vdocuments.us/reader036/viewer/2022062423/56649e635503460f94b5facd/html5/thumbnails/52.jpg)
Thank You!
Questions?