usami bionlp2011
Post on 25-May-2015
456 Views
Preview:
TRANSCRIPT
Automatic Acquisitionof Huge Training Datafor Bio-Medical Named Entity Recognition
Yu Usami, Han-Cheol Cho, Naoaki Okazaki, and Jun’ichi Tsujii
Graduate School of Information Science and Technology University of Tokyo
Introduction
Named Entity RecognitionAM , cystain C and cathepsin B are present as ...
Recent approach:
Machine learning on manually annotated corpus
• BioCreAtIvE task 1A (Yeh et al, 2005)
• Semi-supervised (Vlachos and Gasperin, 2010)
Introduction
Named Entity RecognitionAM , cystain C and cathepsin B are present as ...
Recent approach:
Machine learning on manually annotated corpus
• BioCreAtIvE task 1A (Yeh et al, 2005)
• Semi-supervised (Vlachos and Gasperin, 2010)
Introduction
Named Entity RecognitionAM , cystain C and cathepsin B are present as ...
Recent approach:
Machine learning on manually annotated corpus
• BioCreAtIvE task 1A (Yeh et al, 2005)
• Semi-supervised (Vlachos and Gasperin, 2010)
B B BO O O O OI I
Labels B : Beginning of NE I : Inside of NE O: Out of NE
Introduction
Named Entity RecognitionAM , cystain C and cathepsin B are present as ...
Recent approach:
Machine learning on manually annotated corpus
• BioCreAtIvE task 1A (Yeh et al, 2005)
• Semi-supervised (Vlachos and Gasperin, 2010)
B B BO O O O OI I
Introduction
Named Entity RecognitionAM , cystain C and cathepsin B are present as ...
Recent approach:
Machine learning on manually annotated corpus
• BioCreAtIvE task 1A (Yeh et al, 2005)
• Semi-supervised (Vlachos and Gasperin, 2010)
B B BO O O O OI I Expensive• Cost• Time
Our Idea
Our Idea
Utilize inexpensive and large resources:
Lexical database Unlabeled text
Our Idea
Utilize inexpensive and large resources:
Lexical database Unlabeled text
Build dictionary
Our Idea
Utilize inexpensive and large resources:
Lexical database Unlabeled text
Build dictionary
String match
Our Idea
Utilize inexpensive and large resources:
Lexical database Unlabeled text
Build dictionary
String match
Acquire annotated corpus for Training
Dictionary Building
Dictionary Building
Symbol: CD177
Dictionary Building
Official Name: CD177 molecule
Dictionary Building
Synonyms: NB1, PRV1, HNA2A, CD177
Dictionary Building
CD177 CD177 molecule NB1 PRV1 HNA2A
Dictionary Building
Task Settings
Task: Single class NER
Target Class: Gene-or-gene-product (GGP)
Resources:
• Lexical database: Entrez Gene
include 6,816,109 gene (protein) records
• Unlabeled text: 2009 MEDLINE
include 17,764,827 articles
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
Test data
String match
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training dataString match
Unlabeled text
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
Training data
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
Model
Learn
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
Test data
Apply
Ineffectiveness of Simple Approach
Dictionary-based NER
ML-based NER trained on acquired training data
14.27
40.78
23.83
42.69
10.18
39.03
PRF1
Dic-based
ML-based
Problem of Simple Approach
Stats: Acquired 1,715,344,107 labeled tokens including 10.0% NEs
Examples(A)PMID 1984484: It is clear that in culture media of
AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.
(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.
Goal of This Study
Our ContributionAcquire huge high-quality training datawith lexical database and unlabeled text
Methodology
1. Utilize references (links) for disambiguation
2. Expand NEs based on coordination analysis
3. Gain new NEs by using self-training
Disambiguation
Utilize lexical database references
record AM
reference PMID 1984484
(A)PMID 1984484: It is clear that in culture media of
AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.
(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.
Disambiguation
Utilize lexical database references
record AM
reference PMID 1984484
(A)PMID 1984484: It is clear that in culture media of
AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.
(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.
Disambiguation
Utilize lexical database references
record AM
reference PMID 1984484
(A)PMID 1984484: It is clear that in culture media of
AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.
(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.
Side Effect of Using References
Lacks of the reference in the lexical database
record entA entB entC
ref PMID 19025 1021 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
String matchif referred
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Start from Here
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Coordinate token
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Is this mention included in the dictionary?
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Yes
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Yes
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Coordinate token
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Is this mention included in the dictionary?
Coordination Analysis
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Yes
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Yes
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
Not a coordinate tokenNot included
Expand NEs based on coordination structure
record entA entB entC
ref PMID 4928016
PMID 4928016: ... three genes concerned (designated entA, entB and entC)
Coordination Analysis
End
Self-training
Training Data
Classifier Model Remaining Data
Learning
Apply
Add new NEs
Evaluation Settings
Test corpus:BioNLP 2011 Shared Task EPI corpus(Training set + Development set)
Learning and Decoding:Linear kernel SVM(Predict each token label sequentially)
NER Results
Method Prec. Recall F1
String match 39.03 42.69 40.78 + References 90.62 13.52 23.53 + Coord Analysis 89.66 13.77 23.87
String match 10.18 23.83 14.27 + References 69.25 39.12 50.00 + Coord Analysis 66.79 47.44 55.47 + Self-training 63.72 51.18 56.77
Dic-based
ML-based
Automatic vs Manual
Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance
Trained oneach corpus Manual Automatic
62.6667.8957.9258.56
68.2680.76
P R F1
Automatic vs Manual
Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance
Trained oneach corpus Manual Automatic
62.6667.8957.9258.56
68.2680.76
P R F1F1: 67.89 F1: 62.66
Conclusion
Acquired high-quality training data automatically• Use of references for high-precision • Improve recall with‣ Coordination analysis‣ Self-training
Acquired large size training data• Used 10% (Memory limitation)
Future Work
Utilize all of acquired training data for learning‣ Online learning
Improve self-training performance
Semi-supervised approach with acquired data
Apply to another domain or semantic class
top related