ontology-driven automatic entity disambiguation in unstructured text
DESCRIPTION
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text. Jed Hassell. Introduction. No explicit semantic information about data and objects are presented in most of the Web pages. - PowerPoint PPT PresentationTRANSCRIPT
Ontology-Driven Ontology-Driven Automatic Entity Automatic Entity
Disambiguation in Disambiguation in Unstructured TextUnstructured Text
Jed HassellJed Hassell
IntroductionIntroduction►No explicit semantic information about No explicit semantic information about
data and objects are presented in most data and objects are presented in most of the Web pages.of the Web pages.
►Semantic Web aims to solve this Semantic Web aims to solve this problem by providing an underlying problem by providing an underlying mechanism to add semantic metadata mechanism to add semantic metadata to content:to content: Ex: The entity “UGA” pointing to Ex: The entity “UGA” pointing to
http://www.uga.eduhttp://www.uga.edu Using entity disambiguationUsing entity disambiguation
IntroductionIntroduction►We use background knowledge in the form We use background knowledge in the form
of an ontologyof an ontology►Our contributions are two-fold:Our contributions are two-fold:
A novel method to disambiguate entities within A novel method to disambiguate entities within unstructured textunstructured text by using clues in the text by using clues in the text and exploiting metadata from the ontology, and exploiting metadata from the ontology,
An implementation of our method that uses a An implementation of our method that uses a very large, real-world ontology to demonstrate very large, real-world ontology to demonstrate effective entity disambiguation in the domain effective entity disambiguation in the domain of Computer Science researchers.of Computer Science researchers.
BackgroundBackground►Sesame RepositorySesame Repository
Open source RDF repositoryOpen source RDF repository We chose Sesame, as opposed to Jena and We chose Sesame, as opposed to Jena and
BRAHMS, because of its ability to store BRAHMS, because of its ability to store large amounts of information by not being large amounts of information by not being dependant on memory storage alonedependant on memory storage alone
We chose to use Sesame’s native mode We chose to use Sesame’s native mode because our dataset is typically too large because our dataset is typically too large to fit into memory and using the database to fit into memory and using the database option is too slow in update operationsoption is too slow in update operations
Dataset 1: DBLP OntologyDataset 1: DBLP Ontology► DBLP is a website that contains bibliographic DBLP is a website that contains bibliographic
information for computer scientists, journals information for computer scientists, journals and proceedings:and proceedings: 3,079,414 entities (447,121 are authors)3,079,414 entities (447,121 are authors) We used a SAX parser to parse DBLP XML file that is We used a SAX parser to parse DBLP XML file that is
available onlineavailable online Created relationships such as “co-author”Created relationships such as “co-author” Added information regarding affiliationsAdded information regarding affiliations Added information regarding areas of interestAdded information regarding areas of interest Added alternate spellings for international Added alternate spellings for international
characterscharacters
Dataset 2: DBWorld PostsDataset 2: DBWorld Posts►DBWorldDBWorld
Mailing list of information for upcoming Mailing list of information for upcoming conferences related to the databases fieldconferences related to the databases field
Created a HTML scraper that downloads Created a HTML scraper that downloads everything with “Call for Papers”, “Call for everything with “Call for Papers”, “Call for Participation” or “CFP” in its subjectParticipation” or “CFP” in its subject
Unstructured textUnstructured text
Overview of System Overview of System ArchitectureArchitecture
ApproachApproach►Entity NamesEntity Names
Entity attribute that represents the name Entity attribute that represents the name of the entityof the entity
Can contain more than one nameCan contain more than one name
ApproachApproach► Text-proximity RelationshipsText-proximity Relationships
Relationships that can be expected to be in text-Relationships that can be expected to be in text-proximity of the entityproximity of the entity
Nearness measured in character spacesNearness measured in character spaces
ApproachApproach► Text Co-occurrence RelationshipsText Co-occurrence Relationships
Similar to text-proximity relationships except Similar to text-proximity relationships except proximity is not relevantproximity is not relevant
ApproachApproach►Popular EntitiesPopular Entities
The intuition behind this is to specify The intuition behind this is to specify relationships that will bias the right entity relationships that will bias the right entity to be the most popular entityto be the most popular entity
This should be used with care, depending This should be used with care, depending on the domainon the domain
DBLP ex: the number of papers the entity DBLP ex: the number of papers the entity has authoredhas authored
ApproachApproach► Semantic RelationshipsSemantic Relationships
Entities can be related to one another through Entities can be related to one another through their collaboration networktheir collaboration network
DBLP ex: Entities are related to one another DBLP ex: Entities are related to one another through co-author relationshipsthrough co-author relationships
AlgorithmAlgorithm► Idea is to spot entity names in text Idea is to spot entity names in text
and assign each potential match a and assign each potential match a confidence scoreconfidence score
►This confidence score will be adjusted This confidence score will be adjusted as the algorithm progresses and as the algorithm progresses and represents the certainty that this represents the certainty that this spotted entity represents a particular spotted entity represents a particular object in the ontologyobject in the ontology
Algorithm – Flow ChartAlgorithm – Flow ChartStart Spot entity
names Found?
Do nothing
Initiate confidence
score and store in Candidate
Entities
More entities?
no
yes
Yes
Spot text-proximity
relationships
no
Found?Adjust
confidence score
Do nothingMore
candidate entities?
yes
no
yes
Algorithm – Flow ChartAlgorithm – Flow ChartSpot text co-occurrence
relationshipsFound?
Adjust confidence
score
Do nothingMore
candidate Entities?
yes
no
yes
Adjust confidence score based on
number of popular entity relationships
Search for semantic
relationshipsFound?
Adjust confidence
score
No changeMore
candidate entities?
no
no
yes
yes
Candidate entity rise above threshold?
no Endno
Yes (Iterative Step)
AlgorithmAlgorithm► Spotting Entity NamesSpotting Entity Names
Search document for entity names within the Search document for entity names within the ontologyontology
Each of the entities in the ontology that match a Each of the entities in the ontology that match a name found in the document become a name found in the document become a candidate entitycandidate entity
Assign initial confidence scores for candidate Assign initial confidence scores for candidate entities based on these formulas:entities based on these formulas:
AlgorithmAlgorithm►Spotting Literal Values of Text-Spotting Literal Values of Text-
proximity Relationshipsproximity Relationships Only consider relationships from Only consider relationships from
candidate entitiescandidate entities Substantially increase confidence score if Substantially increase confidence score if
within proximitywithin proximity Ex: Entity affiliation found next to entity Ex: Entity affiliation found next to entity
namename
AlgorithmAlgorithm►Spotting Literal Values of Text Co-Spotting Literal Values of Text Co-
occurrence Relationshipsoccurrence Relationships Only consider relationships from Only consider relationships from
candidate entitiescandidate entities Increase confidence score if found within Increase confidence score if found within
the document (location does not matter)the document (location does not matter) Ex: Entity’s areas of interest found in the Ex: Entity’s areas of interest found in the
documentdocument
AlgorithmAlgorithm►Using Popular EntitiesUsing Popular Entities
Slightly increase the confidence score of Slightly increase the confidence score of candidate entities based on the amount of candidate entities based on the amount of popular entity relationshipspopular entity relationships
Valuable when used as a tie-breakerValuable when used as a tie-breaker Ex: Candidate entities with more than 15 Ex: Candidate entities with more than 15
publications receive a slight increase in publications receive a slight increase in their confidence scoretheir confidence score
AlgorithmAlgorithm►Using Semantic RelationshipsUsing Semantic Relationships
Use relationships among entities to boost Use relationships among entities to boost confidence scores of candidate entitiesconfidence scores of candidate entities
Each candidate entity with a confidence Each candidate entity with a confidence score above the score above the thresholdthreshold is analyzed for is analyzed for semantic relationships to other candidate semantic relationships to other candidate entities. If another candidate entity is entities. If another candidate entity is found and is below the found and is below the thresholdthreshold, that , that entity’s confidence score is increasedentity’s confidence score is increased
AlgorithmAlgorithm► If any candidate entity rises above the If any candidate entity rises above the
thresholdthreshold, the process repeats until , the process repeats until the algorithm stabilizesthe algorithm stabilizes
►This is an iterative step and always This is an iterative step and always convergesconverges
OutputOutput►XML formatXML format
URI – the DBLP URL of the entityURI – the DBLP URL of the entity Entity nameEntity name Confidence scoreConfidence score Character offset – the location of the Character offset – the location of the
entity in the documententity in the document►This is a generic output and can easily This is a generic output and can easily
be converted for use in Microformats, be converted for use in Microformats, RDFa, etc.RDFa, etc.
OutputOutput
Output - MicroformatOutput - Microformat
Evaluation: Gold Standard Evaluation: Gold Standard SetSet
►We evaluate our system using a gold We evaluate our system using a gold standard set of documentsstandard set of documents 20 manually disambiguated documents20 manually disambiguated documents Randomly chose 20 consecutive post from Randomly chose 20 consecutive post from
DBWorldDBWorld We use We use precisionprecision and and recallrecall as the as the
measurement of evaluation for our measurement of evaluation for our systemsystem
Evaluation: Gold Standard Evaluation: Gold Standard SetSet
Evaluation: Gold Standard Evaluation: Gold Standard SetSet
Evaluation: Precision & Evaluation: Precision & RecallRecall
►We define set We define set AA as the set of unique as the set of unique names identified using the names identified using the disambiguated datasetdisambiguated dataset
►We define set We define set BB as the set of entities as the set of entities found by our methodfound by our method
►The intersection of these sets The intersection of these sets represents the set of entities correctly represents the set of entities correctly identified by our methodidentified by our method
Evaluation: Precision & Evaluation: Precision & RecallRecall
► Precision is the Precision is the proportion of correctly proportion of correctly disambiguated entities disambiguated entities with regard to with regard to BB
► Recall is the proportion Recall is the proportion of correctly of correctly disambiguated entities disambiguated entities with regard to with regard to AA
Evaluation: ResultsEvaluation: Results► Precision and recall when compared to Precision and recall when compared to
entire gold standard set:entire gold standard set:
► Precision and recall on a per document Precision and recall on a per document basis:basis:
Correct Disambiguation Found Entities Total Entities Precision Recall
602 620 758 97.1% 79.4%
Precision and Recall
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Documents
Perc
enta
ge
Recall
Precision
Related WorkRelated Work►Semex:Semex:
Personal information management system Personal information management system that works with a user’s desktopthat works with a user’s desktop
Takes advantage of a predictable structureTakes advantage of a predictable structure The results of disambiguated entities are The results of disambiguated entities are
propagated to other ambiguous entities, propagated to other ambiguous entities, which could then be reconciled based on which could then be reconciled based on recently reconciled entities much like our recently reconciled entities much like our work doeswork does
Related WorkRelated Work►Kim:Kim:
An application that aims to be an An application that aims to be an automatic ontology populationautomatic ontology population
Contains an entity recognition portion that Contains an entity recognition portion that uses natural language processorsuses natural language processors
Evaluations performed on human Evaluations performed on human annotated corporaannotated corpora
Missed a lot of entities and results had Missed a lot of entities and results had many false positivesmany false positives
ConclusionConclusion►Our method uses relationships Our method uses relationships
between entities in the ontology to go between entities in the ontology to go beyond traditional syntactic-based beyond traditional syntactic-based disambiguation techniquesdisambiguation techniques
►This work is among the first to This work is among the first to successfully use relationships for successfully use relationships for identifying entities in text without identifying entities in text without relying on the structure of the textrelying on the structure of the text
Thank you!Thank you!