ontology-driven automatic entity disambiguation in unstructured text

Ontology-Driven Ontology-Driven Automatic Entity Automatic Entity

Disambiguation in Disambiguation in Unstructured TextUnstructured Text

Jed HassellJed Hassell

IntroductionIntroduction►No explicit semantic information about No explicit semantic information about

data and objects are presented in most data and objects are presented in most of the Web pages.of the Web pages.

►Semantic Web aims to solve this Semantic Web aims to solve this problem by providing an underlying problem by providing an underlying mechanism to add semantic metadata mechanism to add semantic metadata to content:to content: Ex: The entity “UGA” pointing to Ex: The entity “UGA” pointing to

http://www.uga.eduhttp://www.uga.edu Using entity disambiguationUsing entity disambiguation

IntroductionIntroduction►We use background knowledge in the form We use background knowledge in the form

of an ontologyof an ontology►Our contributions are two-fold:Our contributions are two-fold:

A novel method to disambiguate entities within A novel method to disambiguate entities within unstructured textunstructured text by using clues in the text by using clues in the text and exploiting metadata from the ontology, and exploiting metadata from the ontology,

An implementation of our method that uses a An implementation of our method that uses a very large, real-world ontology to demonstrate very large, real-world ontology to demonstrate effective entity disambiguation in the domain effective entity disambiguation in the domain of Computer Science researchers.of Computer Science researchers.

BackgroundBackground►Sesame RepositorySesame Repository

Open source RDF repositoryOpen source RDF repository We chose Sesame, as opposed to Jena and We chose Sesame, as opposed to Jena and

BRAHMS, because of its ability to store BRAHMS, because of its ability to store large amounts of information by not being large amounts of information by not being dependant on memory storage alonedependant on memory storage alone

We chose to use Sesame’s native mode We chose to use Sesame’s native mode because our dataset is typically too large because our dataset is typically too large to fit into memory and using the database to fit into memory and using the database option is too slow in update operationsoption is too slow in update operations

Dataset 1: DBLP OntologyDataset 1: DBLP Ontology► DBLP is a website that contains bibliographic DBLP is a website that contains bibliographic

information for computer scientists, journals information for computer scientists, journals and proceedings:and proceedings: 3,079,414 entities (447,121 are authors)3,079,414 entities (447,121 are authors) We used a SAX parser to parse DBLP XML file that is We used a SAX parser to parse DBLP XML file that is

available onlineavailable online Created relationships such as “co-author”Created relationships such as “co-author” Added information regarding affiliationsAdded information regarding affiliations Added information regarding areas of interestAdded information regarding areas of interest Added alternate spellings for international Added alternate spellings for international

characterscharacters

Dataset 2: DBWorld PostsDataset 2: DBWorld Posts►DBWorldDBWorld

Mailing list of information for upcoming Mailing list of information for upcoming conferences related to the databases fieldconferences related to the databases field

Created a HTML scraper that downloads Created a HTML scraper that downloads everything with “Call for Papers”, “Call for everything with “Call for Papers”, “Call for Participation” or “CFP” in its subjectParticipation” or “CFP” in its subject

Unstructured textUnstructured text

Overview of System Overview of System ArchitectureArchitecture

ApproachApproach►Entity NamesEntity Names

Entity attribute that represents the name Entity attribute that represents the name of the entityof the entity

Can contain more than one nameCan contain more than one name

ApproachApproach► Text-proximity RelationshipsText-proximity Relationships

Relationships that can be expected to be in text-Relationships that can be expected to be in text-proximity of the entityproximity of the entity

Nearness measured in character spacesNearness measured in character spaces

ApproachApproach► Text Co-occurrence RelationshipsText Co-occurrence Relationships

Similar to text-proximity relationships except Similar to text-proximity relationships except proximity is not relevantproximity is not relevant

ApproachApproach►Popular EntitiesPopular Entities

The intuition behind this is to specify The intuition behind this is to specify relationships that will bias the right entity relationships that will bias the right entity to be the most popular entityto be the most popular entity

This should be used with care, depending This should be used with care, depending on the domainon the domain

DBLP ex: the number of papers the entity DBLP ex: the number of papers the entity has authoredhas authored

ApproachApproach► Semantic RelationshipsSemantic Relationships

Entities can be related to one another through Entities can be related to one another through their collaboration networktheir collaboration network

DBLP ex: Entities are related to one another DBLP ex: Entities are related to one another through co-author relationshipsthrough co-author relationships

AlgorithmAlgorithm► Idea is to spot entity names in text Idea is to spot entity names in text

and assign each potential match a and assign each potential match a confidence scoreconfidence score

►This confidence score will be adjusted This confidence score will be adjusted as the algorithm progresses and as the algorithm progresses and represents the certainty that this represents the certainty that this spotted entity represents a particular spotted entity represents a particular object in the ontologyobject in the ontology

Algorithm – Flow ChartAlgorithm – Flow ChartStart Spot entity

names Found?

Do nothing

Initiate confidence

score and store in Candidate

Entities

More entities?

no

yes

Yes

Spot text-proximity

relationships

no

Found?Adjust

confidence score

Do nothingMore

candidate entities?

yes

no

yes

Algorithm – Flow ChartAlgorithm – Flow ChartSpot text co-occurrence

relationshipsFound?

Adjust confidence

score

Do nothingMore

candidate Entities?

yes

no

yes

Adjust confidence score based on

number of popular entity relationships

Search for semantic

relationshipsFound?

Adjust confidence

score

No changeMore

candidate entities?

no

no

yes

yes

Candidate entity rise above threshold?

no Endno

Yes (Iterative Step)

AlgorithmAlgorithm► Spotting Entity NamesSpotting Entity Names

Search document for entity names within the Search document for entity names within the ontologyontology

Each of the entities in the ontology that match a Each of the entities in the ontology that match a name found in the document become a name found in the document become a candidate entitycandidate entity

Assign initial confidence scores for candidate Assign initial confidence scores for candidate entities based on these formulas:entities based on these formulas:

AlgorithmAlgorithm►Spotting Literal Values of Text-Spotting Literal Values of Text-

proximity Relationshipsproximity Relationships Only consider relationships from Only consider relationships from

candidate entitiescandidate entities Substantially increase confidence score if Substantially increase confidence score if

within proximitywithin proximity Ex: Entity affiliation found next to entity Ex: Entity affiliation found next to entity

namename

AlgorithmAlgorithm►Spotting Literal Values of Text Co-Spotting Literal Values of Text Co-

occurrence Relationshipsoccurrence Relationships Only consider relationships from Only consider relationships from

candidate entitiescandidate entities Increase confidence score if found within Increase confidence score if found within

the document (location does not matter)the document (location does not matter) Ex: Entity’s areas of interest found in the Ex: Entity’s areas of interest found in the

documentdocument

AlgorithmAlgorithm►Using Popular EntitiesUsing Popular Entities

Slightly increase the confidence score of Slightly increase the confidence score of candidate entities based on the amount of candidate entities based on the amount of popular entity relationshipspopular entity relationships

Valuable when used as a tie-breakerValuable when used as a tie-breaker Ex: Candidate entities with more than 15 Ex: Candidate entities with more than 15

publications receive a slight increase in publications receive a slight increase in their confidence scoretheir confidence score

AlgorithmAlgorithm►Using Semantic RelationshipsUsing Semantic Relationships

Use relationships among entities to boost Use relationships among entities to boost confidence scores of candidate entitiesconfidence scores of candidate entities

Each candidate entity with a confidence Each candidate entity with a confidence score above the score above the thresholdthreshold is analyzed for is analyzed for semantic relationships to other candidate semantic relationships to other candidate entities. If another candidate entity is entities. If another candidate entity is found and is below the found and is below the thresholdthreshold, that , that entity’s confidence score is increasedentity’s confidence score is increased

AlgorithmAlgorithm► If any candidate entity rises above the If any candidate entity rises above the

thresholdthreshold, the process repeats until , the process repeats until the algorithm stabilizesthe algorithm stabilizes

►This is an iterative step and always This is an iterative step and always convergesconverges

OutputOutput►XML formatXML format

URI – the DBLP URL of the entityURI – the DBLP URL of the entity Entity nameEntity name Confidence scoreConfidence score Character offset – the location of the Character offset – the location of the

entity in the documententity in the document►This is a generic output and can easily This is a generic output and can easily

be converted for use in Microformats, be converted for use in Microformats, RDFa, etc.RDFa, etc.

OutputOutput

Output - MicroformatOutput - Microformat

Evaluation: Gold Standard Evaluation: Gold Standard SetSet

►We evaluate our system using a gold We evaluate our system using a gold standard set of documentsstandard set of documents 20 manually disambiguated documents20 manually disambiguated documents Randomly chose 20 consecutive post from Randomly chose 20 consecutive post from

DBWorldDBWorld We use We use precisionprecision and and recallrecall as the as the

measurement of evaluation for our measurement of evaluation for our systemsystem

Evaluation: Gold Standard Evaluation: Gold Standard SetSet

Evaluation: Precision & Evaluation: Precision & RecallRecall

►We define set We define set AA as the set of unique as the set of unique names identified using the names identified using the disambiguated datasetdisambiguated dataset

►We define set We define set BB as the set of entities as the set of entities found by our methodfound by our method

►The intersection of these sets The intersection of these sets represents the set of entities correctly represents the set of entities correctly identified by our methodidentified by our method

Evaluation: Precision & Evaluation: Precision & RecallRecall

► Precision is the Precision is the proportion of correctly proportion of correctly disambiguated entities disambiguated entities with regard to with regard to BB

► Recall is the proportion Recall is the proportion of correctly of correctly disambiguated entities disambiguated entities with regard to with regard to AA

Evaluation: ResultsEvaluation: Results► Precision and recall when compared to Precision and recall when compared to

entire gold standard set:entire gold standard set:

► Precision and recall on a per document Precision and recall on a per document basis:basis:

Correct Disambiguation Found Entities Total Entities Precision Recall

602 620 758 97.1% 79.4%

Precision and Recall

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Documents

Perc

enta

ge

Recall

Precision

Related WorkRelated Work►Semex:Semex:

Personal information management system Personal information management system that works with a user’s desktopthat works with a user’s desktop

Takes advantage of a predictable structureTakes advantage of a predictable structure The results of disambiguated entities are The results of disambiguated entities are

propagated to other ambiguous entities, propagated to other ambiguous entities, which could then be reconciled based on which could then be reconciled based on recently reconciled entities much like our recently reconciled entities much like our work doeswork does

Related WorkRelated Work►Kim:Kim:

An application that aims to be an An application that aims to be an automatic ontology populationautomatic ontology population

Contains an entity recognition portion that Contains an entity recognition portion that uses natural language processorsuses natural language processors

Evaluations performed on human Evaluations performed on human annotated corporaannotated corpora

Missed a lot of entities and results had Missed a lot of entities and results had many false positivesmany false positives

ConclusionConclusion►Our method uses relationships Our method uses relationships

between entities in the ontology to go between entities in the ontology to go beyond traditional syntactic-based beyond traditional syntactic-based disambiguation techniquesdisambiguation techniques

►This work is among the first to This work is among the first to successfully use relationships for successfully use relationships for identifying entities in text without identifying entities in text without relying on the structure of the textrelying on the structure of the text

Thank you!Thank you!

ontology-driven automatic entity disambiguation in unstructured text

Documents

entity uga

spotted entity

right entity

affiliationsadded information

bibliographic information

large amounts of information

coauthoradded information

semantic metadata