1
Effective Named Entity Recognition for Idiosyncratic Web Collections
Roman Prokofyev, Gianluca Demartini, Philippe Cudre-MaurouxeXascale Infolab, University of Fribourg, Switzerland
WWW 2014April 10, 2014
2
Outline• Introduction
• Problem definition• Existing approaches and applicability
• Overview• Candidate Named Entities Selection• Dataset description• Features description• Experimental setup & Evaluation
3
Problem Definition
• search engine• web search
engine• navigational
query• user intent• information need• web content• …
Entity type: scientific concept
4
Traditional NERTypes:• Maximum Entropy (Mallet, NLTK)• Conditional Random Fields (Stanford NER, Mallet)
Properties:• Require extensive training• Usually domain-specific, different collections require training on their domain
• Very good at detecting such types as Location, Person, Organization
5
Proposed ApproachOur problem is defined as a classification task.
Two-step classification:• Extract candidate named entities using frequency filtration algorithm.
• Classify candidate named entities using supervised classifier.
Candidate selection should allow us to greatly reduce the number of n-grams to classify, possibly without significant loss in Recall.
6
Pipeline
7
Candidate Selection: Part IConsider all bigrams with frequency > k (k=2):
candidate named: 5entity are: 4entity candidate: 3entity in: 18entity recognition: 12named entity: 101of named: 10that named: 3the named: 4
candidate named: 5entity candidate: 3entity recognition: 12named entity: 101
NLTK stop word filter
8
Candidate Selection: Part IITrigram frequency is looked up from the n-gram index. candidate named entity: 5
named entity candidate: 3named entity recognition: 12named entity: 101candidate named: 5entity candidate: 3entity recognition: 12
candidate named: 5entity candidate: 3entity recognition: 12named entity: 101
candidate named entity: 5named entity candidate: 3named entity recognition: 12named entity: 81candidate named: 0entity candidate: 0entity recognition: 0
9
Candidate Selection: Discussion
Possible to extract n-grams (n>2) with frequency ≤k
10
After Candidate Selection
TwiNER: named entity recognition in targeted twitter stream‘SIGIR 2012
11
Classifier: OverviewMachine Learning algorithm:Decision Trees from scikit-learn package.
Feature types:• POS Tags and their derivatives• External Knowledge Bases (DBLP, DBPedia)• DBPedia relation graphs• Syntactic features
12
DatasetsTwo collections:• CS Collection (SIGIR 2012 Research Track): 100 papers
• Physics collection: 100 papers randomly selected from arXiv.org High Energy Physics category
CS Collection Physics Collection
N# Candidate N-grams 21 531 18 129
N# Judged N-grams 15 057 11 421
N# Valid Entities 8 145 5 747
N# Invalid N-grams 6 912 5 674
Available at: github.com/XI-lab/scientific_NER_dataset
13
Features: POS Tags, part I
100+ different tag patterns
14
Features: POS Tags, part II
Two feature schemes:• Raw POS tag patterns, each tag is a binary feature• Regex POS tag patterns:
• First tag match, for example:
• Last tag match:
JJ NNSJJ NN NNJJ NN...
JJ*
NN VBNN NN VBJJ NN VB...
*VB
15
Features: External Knowledge BasesDomain-specific knowledge bases:
• DBLP (Computer Science): contains author-assigned keywords to the papers
• ScienceWISE: high-quality scientific concepts (mostly for Physics domain) http://sciencewise.info
We perform exact string matching with these KBs.
16
Features: DBPedia, part IDBPedia pages essentially represent valid entities
But there are a few problems when:• N-gram is not an entity• N-gram is not a scientific concept (“Tom Cruise”
in IR paper)
CS Collection Physics Collection
Precision Recall Precision Recall
Exact string matching 0.9045 0.2394 0.7063 0.0155
Matching with redirects 0.8457 0.4229 0.7768 0.5843
17
Features: DBPedia, part II
Without redirects With redirects
18
Features: Syntactic
Set of common syntactic features:• N-gram length in words• Whether n-gram is uppercased• The number of other n-gram given n-gram is part of
19
Experiments: Overview
1. Regex POS Patterns vs Normal POS tags2. Redirects vs Non-redirects3. Feature importance scores4. MaxEntropy comparison
All results are obtained using average with 10-fold cross-validation.
20
Experiments: Comparison I
CS Collection Precision Recall F1 score
Accuracy N# features
Normal POS + Components
0.8794 0.8058* 0.8409* 0.8429* 54
Regex POS + Components
0.8475* 0.8524* 0.8499* 0.8448* 9
Normal POS + Components-Redirects
0.8678* 0.8305* 0.8487* 0.8473 50
Regex POS + Components-Redirects
0.8406* 0.8769 0.8584 0.8509 7
The symbol * indicates a statistically significant difference as compared to the approach in bold.
21
Experiments: Comparison II
Physics Collection Precision Recall F1 score
Accuracy N# features
Normal POS + Components
0.8253* 0.6567* 0.7311* 0.7567 53
Regex POS + Components
0.7941* 0.6781 0.7315* 0.7492* 4
Normal POS + Components-Redirects
0.8339 0.6674* 0.7412 0.7653 50
Regex POS + Components-Redirects
0.8375 0.6479* 0.7305* 0.7592* 6
The symbol * indicates a statistically significant difference as compared to the approach in bold.
22
Experiments: Feature Importance
Importance
NN STARTS 0.3091
DBLP 0.1442
Components + DBLP 0.1125
Components 0.0789
VB ENDS 0.0386
NN ENDS 0.0380
JJ STARTS 0.0364
Importance
ScienceWISE 0.2870
Component + ScienceWISE
0.1948
Wikipedia redirect 0.1104
Components 0.1093
Wikilinks 0.0439
Participation count 0.0370
CS Collection, 7 features Physics Collection, 6 features
23
Experiments: MaxEntropy
Precision Recall F1 score
Maximum Entropy 0.6566 0.7196 0.6867
Decision Trees 0.8121 0.8742 0.8420
MaxEnt classifier receives full text as input.(we used a classifier from NLTK package)
Comparison experiment: 80% of CS Collection as a training data, 20% as a test dataset.
24
Lessons LearnedClassic NER approaches are not good enough for Idiosyncratic Web Collections
Leveraging the graph of scientific concepts is a key feature
Domain specific KBs and POS patterns work well
Experimental results show up to 85% accuracy over different scientific collectionshttp://iner.exascale.info/
eXascale Infolab, http://exascale.info