usami bionlp2011

Automatic Acquisitionof Huge Training Datafor Bio-Medical Named Entity Recognition

Yu Usami, Han-Cheol Cho, Naoaki Okazaki, and Jun’ichi Tsujii

Graduate School of Information Science and Technology University of Tokyo

Introduction

Named Entity RecognitionAM , cystain C and cathepsin B are present as ...

Recent approach:

Machine learning on manually annotated corpus

• BioCreAtIvE task 1A (Yeh et al, 2005)

• Semi-supervised (Vlachos and Gasperin, 2010)

Introduction

Recent approach:

Introduction

Recent approach:

B B BO O O O OI I

Labels B : Beginning of NE I : Inside of NE O: Out of NE

Introduction

Recent approach:

B B BO O O O OI I

Introduction

Recent approach:

B B BO O O O OI I Expensive• Cost• Time

Our Idea

Utilize inexpensive and large resources:

Lexical database Unlabeled text

Our Idea

Build dictionary

Our Idea

Build dictionary

String match

Our Idea

Build dictionary

String match

Acquire annotated corpus for Training

Dictionary Building

Symbol: CD177

Dictionary Building

Official Name: CD177 molecule

Dictionary Building

Synonyms: NB1, PRV1, HNA2A, CD177

Dictionary Building

CD177 CD177 molecule NB1 PRV1 HNA2A

Dictionary Building

Task Settings

Task: Single class NER

Target Class: Gene-or-gene-product (GGP)

Resources:

• Lexical database: Entrez Gene

include 6,816,109 gene (protein) records

• Unlabeled text: 2009 MEDLINE

include 17,764,827 articles

Ineffectiveness of Simple Approach

Dictionary-based NER

ML-based NER trained on acquired training data

Test data

String match

ML-based NER trained on acquired training dataString match

Unlabeled text

Training data

Test data

Dic-based

ML-based

Problem of Simple Approach

Stats: Acquired 1,715,344,107 labeled tokens including 10.0% NEs

Examples(A)PMID 1984484: It is clear that in culture media of

AM, cystatin C and cathepsin B are present as proteinase-antiproteinase complexes.

(B)PMID 23456: Temperature in puerperium is higher in AM, lower in PM.

Goal of This Study

Our ContributionAcquire huge high-quality training datawith lexical database and unlabeled text

Methodology

1. Utilize references (links) for disambiguation

2. Expand NEs based on coordination analysis

3. Gain new NEs by using self-training

Disambiguation

Utilize lexical database references

record AM

reference PMID 1984484

(A)PMID 1984484: It is clear that in culture media of

Disambiguation

record AM

Disambiguation

record AM

Side Effect of Using References

Lacks of the reference in the lexical database

record entA entB entC

ref PMID 19025 1021 4928016

PMID 4928016: ... three genes concerned (designated entA, entB and entC)

String matchif referred

Expand NEs based on coordination structure

ref PMID 4928016

Coordination Analysis

ref PMID 4928016

Start from Here

ref PMID 4928016

Coordinate token

ref PMID 4928016

Is this mention included in the dictionary?

ref PMID 4928016

Coordinate token

ref PMID 4928016

Is this mention included in the dictionary?

ref PMID 4928016

Not a coordinate tokenNot included

ref PMID 4928016

Self-training

Training Data

Classifier Model Remaining Data

Learning

Add new NEs

Evaluation Settings

Test corpus:BioNLP 2011 Shared Task EPI corpus(Training set + Development set)

Learning and Decoding:Linear kernel SVM(Predict each token label sequentially)

NER Results

Method Prec. Recall F1

String match 39.03 42.69 40.78 + References 90.62 13.52 23.53 + Coord Analysis 89.66 13.77 23.87

String match 10.18 23.83 14.27 + References 69.25 39.12 50.00 + Coord Analysis 66.79 47.44 55.47 + Self-training 63.72 51.18 56.77

Dic-based

ML-based

Automatic vs Manual

Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance

Trained oneach corpus Manual Automatic

62.6667.8957.9258.56

68.2680.76

P R F1

Automatic vs Manual

Type Total tokens NE tokens Manual 161,577 12,603 Automatic 48,677,426 3,055,362 NER Performance

Trained oneach corpus Manual Automatic

62.6667.8957.9258.56

68.2680.76

P R F1F1: 67.89 F1: 62.66

Conclusion

Acquired high-quality training data automatically• Use of references for high-precision • Improve recall with‣ Coordination analysis‣ Self-training

Acquired large size training data• Used 10% (Memory limitation)

Future Work

Utilize all of acquired training data for learning‣ Online learning

Improve self-training performance

Semi-supervised approach with acquired data

Apply to another domain or semantic class

usami bionlp2011

cathepsin b

training data training

f1mlbased ner

cystain c

annotated corpusfor

cystatin c

lexical databaseunlabeled

task semisupervised

Technology

17th august/agosto 2019 - amazon s3 · 34 matsuzaka...

a multi-vdd dynamic variable-pipeline on-chip router for...

usami lures - catalogo lures 2011 japon

miami swimming club hy-tek's meet manager 5.0 ......21...

toru usami

usami-net.comusami-net.com/assets/img/content/front/event_dairaitensai.pdf ·...

usami 2014

marrakech - smorlccc, a. c. · 2018-10-27 · marrakech...

ultra fine-grained run-time power gating of on-chip routers...

fracture of the talus in children...

thermoluminescence and esr study of shocked minerals k....

economy specific research and introduction of successful...

reaction rate distribution measurement and the core ... ·...

happiness in switzerland chika usami, natsuki kashiwase and...

a simple method for extracting the natural beauty of hair...