Download - Background
A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields
Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto
Nara Institute of Science and Technology
EMNLP-CoNLL 2007 29th June
Prague, Czech
2
Background
Named Entity
Proper nouns (e.g. Shinzo Abe (Person), Prague (Location)), time/date expressions (e.g. June 29 (Date)) and numerical expressions (e.g. 10%)
In many NLP applications (e.g. IE, QA), Named Entities play an important role
Named Entity Recognition task (NER)
Treated as sequential tagging problem
Machine learning methods have been proposed
Recall is usually low
Large scale NE dictionary is useful for NERSemi-automatic methods to compile NE dictionaries have
been demanded
3
Resource for NE dictionary construction
Wikipedia
Multi-lingual encyclopedia on the Web
382,613 gloss articles (as of June 20, 2007, Japanese)
Gloss indices are composed by nouns or proper nouns
HTML (Semi-structured text)
Lists(<LI>) and Tables(<TABLE>) can be used as clues for NE type categorization
Linked articles are glossed by anchor texts in articles
Each article has one or more categories
Wikipedia has useful information for NE categorizationCan be considered as a suitable resource
4
Objective
Extract Named Entities by assigning proper NE labels for gloss indices of Wikipedia
Person
Product
Person Location
OrganizationNatural Object
5
Use of Wikipedia features
Features of Wikipedia articles
Anchors of an article refer to the other related articles
Anchors in list elements have dependencies each other
=> Make 3 assumptions about dependencies between anchors
an example of a list structure
Burt Bacharach…composer
Dillard & Clark
Carpenters
Karen Carpenter
ORGANIZATION
PERSON
VOCATION
ORGANIZATION
PERSON Assumption 1 : The latter element in a list item tends to be in an attribute relation to the former element
Assumption 2 : The elements in the same itemization tends to be in the same NE category
Assumption 3 : The nested element tends to be in a part-of relation to the upper element
6
Overview of our approach
Focus on HTML list structure in Wikipedia
Make 3 assumptions about dependencies between anchors
Formalize NE categorization problem as labeling NE classes to anchors in lists
Define 3 kinds of cliques (edges: Sibling, Cousin and Relative ) between anchors based on 3 assumptions
Construct graphs based on 3 defined cliques
CRFs for NE categorization in Wikipedia
Define potential functions over 3 edges (and nodes) to provide conditional distribution over the graphs
Estimate MAP label assignment over the graphs using Conditional Random Fields
7
Conditional Random Fields (CRFs)
Conditional Random Fields [Lafferty 2001]
Discriminative, Undirected Models Define conditional distribution p(y|x) Features
Arbitrary features can be used Globally optimize on all possible label assignments Can deal with label dependencies by defining potential functions for
cliques (2 or more nodes)
Cc k
cckk fZp yx
xxy ,exp
)(
1)|(
x
y1 y3 yn・・・y2
parameter model:function feature:,cliques:,functionpartition :)( kkfCZ ,x
8
Use of dependencies for categorization
NE categorization problem as labeling classes to anchors
The edges of the constructed graphs corresponds to a particular dependency
Estimate MAP label assignment over the constructed graphs using Conditional Random Fields
Our formulation: Can extract anchors without gloss articles
Dillard & Clark..country rock
Carpenters
Karen Carpenter
: article exists
: article does not exist
9
Clique definition based on HTML tree structure
Sibling
Cousin
Relative
<UL>
Dillard & Clark
country rock
Carpenters
Karen Carpenter
<LI> <LI>
<A> <A> <A> <UL>
<LI>
<A>
Dillard & Clark…country rock
Carpenters
Karen Carpenter
The latter element tends to be in an attribute or a concept of the former element
Sibling
The elements tend to have a common NE category (e.g. ORGANIZATION)
Cousin
The latter element tends to be in a constituent part of the former element
Relative Use these 3 relations as cliques of CRFs
10
A graph constructed from 3 clique definitions
Burt Bacharach…”On my own”…1986
Dillard & Clark
Gene Clark
Carpenters …”As Time Goes By”…2000
Karen Carpenter
S : Sibling
C : Cousin
R : Relative
R
R
C
C
C C
S S
SSC
Estimate the MAP label assignment over the graph
The latter element tends to be an attribute or a concept of the former element
Sibling
The elements tend to have a common attribute (e.g. ORGANIZATION)
Cousin
The latter element tends to be a constituent part of the former element
Relative
11
Model
xx
xx
xy
,exp,
,exp),(
,,)(
1)|(
''
'
,,),(
ikk
kiV
jikk
kjiSCR
VviV
EEEvvjiSCR
yfy
yyfyy
yyyZ
piRCSji
SCR
V : Potential function for nodes
: Potential function for Sibling, Cousin and Relative cliques
R
R
C
C
C C
S S
SS
• Constructed graphs include cycles : exact inference is computationally expensive
->Introduce Tree-based Reparameterization (TRP) [Wainwright 2003] for approximate inference
12
Experiments
The aims of experiments are:
1. Compare graph-based approach (relational) to node-wise approach (independent) to investigate how the relational classification improves classification accuracy
2. Investigate the effect of defined cliques
3. Compare CRFs models to baseline models based on SVMs
4. Show the effectiveness of using marginal probability for filtering NE candidates.
13
Dataset
Dataset Randomly sampled 2300 articles (Japanese
version as of October 2005)
Anchors in list elements(<LI>) are hand-annotated with NE class label
We used Extended Named Entity Hierarchy (Sekine et al. 2002)
We reduced the number of classes to 13 from the original 200+ in order to avoid data sparseness
Classification target :16136 (14285 of those are NEs)
NE Class # of articles
EVENT 121
PERSON 3315
UNIT 15
LOCATION 1480
FACILITY 2449
TITLE 42
ORGANIZATION 991
VOCATION 303
NATURAL_OBJECT 1132
PRODUCT 1664
NAME_OTHER 24
TIMEX/NUMEX 2749
OTHER 1851
ALL 16136
14
Experiments (CRFs)
SC model
C
C
C C
S S
SSC
SCR model
C
C
C C
S S
SS
R
R
C
SR model
S S
SS
R
R
CR model
C
C
C C
R
R
C
S model
S S
SS
C model
C
C
C C
C
R modelR
R
I model
To investigate which clique type contributes classification accuracy:
We construct models that constitute of possible combinations of defined cliques
8 models (SCR, SC, SR, CR, S, C, R, I)
Classification is performed on each connected subgraph
15
Baseline : Support Vector Machines (SVMs) [ Vapnik 1998 ]
We perform two models:
I model: each anchor text is classified independently
P model: anchor texts are ordered by linear position in HTML, and performed history-based classification (j-1th classification result is used in j-th classification)
For multi-class classification : one-versus-rest
Evaluation
5-fold cross validation, by F1-value
Experimental settings (Baseline) , Evaluation
I model P model
16
Results (F1-value)
CRFs SVMs
# SCR SC SR CR S C R I P I
ALL 14285 .7854 .7855 .7822 .7862 .7817 .7845 .7813 .7805 .7798 .7790
no article 3898 .5465 .5484. .5223 .5495 .5271 .5475 .5273 .5249 .5386 .5278
SC model
C
CC C
S S
SSC
SCR model
C
CC C
S S
SS
R
R
C
SR model
S S
SS
R
R
CR model
C
CC C
R
R
C
S model
S S
SS
C model
C
CC C
C
R modelR
R
I model
CRFs
P model
I model
SVMs
ALL : whole dataset , no article : anchors without articles
17
Results (F1-value)
CRFs SVMs
# SCR SC SR CR S C R I P I
ALL 14285 .7854 .7855 .7822 .7862 .7817 .7845 .7813 .7805 .7798 .7790
no article 3898 .5465 .5484. .5223 .5495 .5271 .5475 .5273 .5249 .5386 .5278
SC model
C
CC C
S S
SSC
SCR model
C
CC C
S S
SS
R
R
C
SR model
S S
SS
R
R
CR model
C
CC C
R
R
C
S model
S S
SS
C model
C
CC C
C
R modelR
R
I model
CRFs
P model
I model
SVMs
1. Graph-based vs. Node-wise
Performed McNemar paired test on labeling disagreements
=> difference was significant (p < 0.01)
ALL : whole dataset , no article : anchors without articles
18
Results (F1-value)
CRFs SVMs
# SCR SC SR CR S C R I P I
ALL 14285 .7854 .7855 .7822 .7862 .7817 .7845 .7813 .7805 .7798 .7790
no article 3898 .5465 .5484. .5223 .5495 .5271 .5475 .5273 .5249 .5386 .5278
SC model
C
CC C
S S
SSC
SCR model
C
CC C
S S
SS
R
R
C
SR model
S S
SS
R
R
CR model
C
CC C
R
R
C
S model
S S
SS
C model
C
CC C
C
R modelR
R
I model
CRFs
P model
I model
SVMs
2. Which clique is most contributed? => Cousin clique
Cousin cliques provided the highest accuracy improvements compare to
sibling and relative cliques
ALL : whole dataset , no article : anchors without articles
19
Results (F1-value)
CRFs SVMs
# SCR SC SR CR S C R I P I
ALL 14285 .7854 .7855 .7822 .7862 .7817 .7845 .7813 .7805 .7798 .7790
no article 3898 .5465 .5484. .5223 .5495 .5271 .5475 .5273 .5249 .5386 .5278
SC model
C
CC C
S S
SSC
SCR model
C
CC C
S S
SS
R
R
C
SR model
S S
SS
R
R
CR model
C
CC C
R
R
C
S model
S S
SS
C model
C
CC C
C
R modelR
R
I model
CRFs
P model
I model
SVMs
3. CRFs vs. SVMs Significance Test: McNemar paired test on labeling disagreements
ALL : whole dataset , no article : anchors without articles
20
Filtering NE candidates using marginal probability
Construct dictionaries from extracted NE candidates
Methods with lower cost are desirable
Extract only confident NE candidates
-> Use of marginal probability that provided by CRFs
Marginal probability
probability of a particular label assignment for a node
This can be regarded as
“confidence” of a classifier
iy
i pyp\
)|()|(y
xyx
Vvi
yi
21
Precision-Recall Curve
Precision-Recall curve obtained by thresholding the marginal probability of the MAP estimation in the CR model of CRFs
At this point, recall value is about 0.57 and precision value is about 0.97
Using the proper thresholding of marginal probability, NE dictionary can be constructed with lower cost
22
Summary and future work
Summary
Proposed a method for categorizing NEs in Wikipedia
Defined 3 kinds of cliques (Sibling, Cousin and Relative) over HTML tree
Graph-based model achieved significant improvements compare to Node-wise model, and baseline methods (SVMs)
NEs can be extracted with lower cost by exploiting marginal probability
23
Summary and Future work
Future work
Use fine-grained NE classes
For many NLP applications (e.g. QA, IE), NE dictionary with fine grained label sets will be a useful resource
Classification with statistical methods becomes difficult in case that the label set is large, because of the insufficient positive examples
Incorporate hierarchical structure of label sets into our models (Hierarchical Classification)
Previous work suggest that exploiting hierarchical structure of label sets improve classification accuracy