![Page 1: Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh](https://reader034.vdocuments.us/reader034/viewer/2022051412/55146cef5503462d4e8b5e76/html5/thumbnails/1.jpg)
Named-Entity Recognition with Character-Level Models
Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning
Stanford University
CoNLL-2003: Seventh Conference on Natural Language Learning
[email protected] [email protected] [email protected] [email protected]
![Page 2: Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh](https://reader034.vdocuments.us/reader034/viewer/2022051412/55146cef5503462d4e8b5e76/html5/thumbnails/2.jpg)
2
Unknown Words are a Central Challenge for NER
Recognizing known named-entities (NEs) is relatively simple and accurate
Recognizing novel NEs requires recognizing context and/or word-internal features
External context and frequent internal words (e.g. “Inc.”) are most commonly used features
Internal composition of NEs alone provide surprisingly strong evidence for classification (Smarr & Manning, 2002) Staffordshire Abdul-Karim al-Kabariti CentrInvest
![Page 3: Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh](https://reader034.vdocuments.us/reader034/viewer/2022051412/55146cef5503462d4e8b5e76/html5/thumbnails/3.jpg)
3
Are Names Self-Describing?
NO: names can be opaque/ambiguousWord-Level: “Washington” occurs as LOC, PER, and
ORGChar-Level: “–ville” suggests LOC, but exceptions
like “Neville”
YES: names can be highly distinctive/descriptiveWord-Level: “National Bank” is a bank (i.e. ORG)Char-Level: “Cotramoxazole” is clearly a drug
name
Question: Overall, how informative are names alone?
![Page 4: Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh](https://reader034.vdocuments.us/reader034/viewer/2022051412/55146cef5503462d4e8b5e76/html5/thumbnails/4.jpg)
4
How Internally Descriptive are Isolated Named Entities?
Classification accuracy of pre-segmented CoNLL NEs without context is ~90%
Using character n-grams as features instead of words yields 25% error reduction
On single-word unknown NEs, word model is at chance; char n-gram model fixes 38% of errors
89.1
91.8
80
90
100
Words Char N-Grams
All NEs
37.5
60.7
30
40
50
60
70
Words Char N-Grams
Single-word UNKs
NE Classification Accuracy (%)[not CoNLL task]
![Page 5: Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh](https://reader034.vdocuments.us/reader034/viewer/2022051412/55146cef5503462d4e8b5e76/html5/thumbnails/5.jpg)
5
Exploiting Word-Internal Features
Many existing systems use some word-internal features (suffix, capitalization, punctuation, etc.)
e.g. Mikheev 97, Wacholder et al 97, Bikel et al 97 Features usually language-dependent (e.g. morphology)
Our approach: use char n-grams as primary representation
Use all substrings as classification features:
Char n-grams subsume word features Features are language-independent (assuming its
alphabetic) Similar in spirit to Cucerzan and Yarowsky (99), but uses
ALL char n-grams vs. just prefix/suffix
#Tom##Tom#, #Tom, Tom#, #To,
Tom, om#, #T, To, om, m#, T, o, m
![Page 6: Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh](https://reader034.vdocuments.us/reader034/viewer/2022051412/55146cef5503462d4e8b5e76/html5/thumbnails/6.jpg)
6
Character-Feature Based Classifier
Model I: Independent classification at each word maxent classifiers, trained using conjugate gradient equal-scale gaussian priors for smoothing trained models with >800K features in ~2 hrs
POS tags and contextual features complement n-grams
Description Added Features Overall F1 (English Dev.)
Words w0
Official Baseline
-
Char N-Grams n(w0)
POS Tags t0
Simple Context
w-1, w0, t-1, t1
More Context ‹w-1, w0›, ‹w0, w1›, ‹t-1, t0›, ‹t0, w1›
52.29
73.10
74.17
82.39
83.09
71.18
![Page 7: Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh](https://reader034.vdocuments.us/reader034/viewer/2022051412/55146cef5503462d4e8b5e76/html5/thumbnails/7.jpg)
7
Character-Based CMM
Model II: Joint classifications along the sequence
Previous classification decisions are clearly relevant: “Grace Road” is a single location, not a
person + location Include neighboring classification
decisions as features Perform joint inference across chain of
classifiers Conditional Markov Model (CMM, aka. maxent
Markov model) Borthwick 1999, McCallum et al 2000
![Page 8: Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh](https://reader034.vdocuments.us/reader034/viewer/2022051412/55146cef5503462d4e8b5e76/html5/thumbnails/8.jpg)
8
Character-Based CMM
Final extra features: Letter-type patterns for each word
United Xx, 12-month d-x, etc. Conjunction features
E.g., previous state and current signature Repeated last words of multi-word names
E.g., Jones after having seen Doug Jones … and a few more
Description Added Features Overall F1 (English Dev)
More Context ‹w-1, w0›, ‹w0, w1›, ‹t-1, t0›, ‹t0, w1›
Simple Sequence
s-1, ‹s-1, t-1, t0›
More Sequence ‹s-2, s-1›, ‹s-2, s-1, t-1, t0›
Final misc. extra features
83.09
87.21
92.27
85.44
![Page 9: Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh](https://reader034.vdocuments.us/reader034/viewer/2022051412/55146cef5503462d4e8b5e76/html5/thumbnails/9.jpg)
9
Final Results
Drop from English dev to test largely due to inconsistent labeling
Lack of capitalization cues in German hurts recall more because maxent classifier is precision-biased when faced with weak evidence
92.27
86.31
67.03
71.90
50
60
70
80
90
100
Eng Dev Eng Test Ger Dev Ger Test
Precision Recall F1
![Page 10: Named-Entity Recognition with Character-Level Models Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D. Manning Stanford University CoNLL-2003: Seventh](https://reader034.vdocuments.us/reader034/viewer/2022051412/55146cef5503462d4e8b5e76/html5/thumbnails/10.jpg)
10
Conclusions
Character substrings are valuable and underexploited model features Named entities are internally quite
descriptive 25-30% error reduction vs. word-level models
Discriminative maxent models allow productive feature engineering 30% error reduction vs. basic model
What distinguishes our approach? More and better features Regularization is crucial for preventing
overfitting