extracting chemical- protein interactions using long short-term memory networks · 2018-11-26 ·...

Extracting Chemical-Protein Interactions using Long Short-Term Memory Networks

Sérgio Matosaleixomatos@ua.pthttp://bioinformatics.ua.pt

BioCreative VI WorkshopBethesda, 18-20 October 2017

Problem

• Detect relations between chemical compounds/drug and genes/proteins in PubMed abstracts

• Gold-standard entities provided

• Five relation classes

MethodsOverview

• Multi-class classification

• Each possible chemical-gene pair in each sentence considered as an instance

• Labeled as one of five classes or negative

• Dependency features + linear sentence features

MethodsOverview

• Multi-class classification

• Each possible chemical-gene pair in each sentence considered as an instance

• Labeled as one of five classes or negative

• Dependency features + linear sentence features

MethodsPreprocessing

• Extract each chemical-protein pair• Extract shortest path

– TEES (https://github.com/jbjorne/TEES) - BLLIP parser– Datasets preprocessed using command line tool: XML output– Create sentence graph and extract SP (NetworkX)– Gold-standard entity annotations mapped to tokens– Use head word for multi-word entity mentions

MethodsFeatures

1) wordsinshortestpath– entitiesblinded2) POStagsofwordsinshortestpath3) shortestpathdependencies4) upto30wordsbeforethefirstentity5) thewordsbetweentheentities6) upto30wordsafterthesecondentity

MethodsFeatures

1) chemical effectscomparedthosediclofenacinhibitorgene2) NNNNSVBNDTNNNNNN3) prep_of nsubjpass prep_with prep_of appos nn4) theeffectsof5) werecomparedwiththoseofdiclofenacanonselective6) inhibitor

Training Development Test

Documents 1020 612 3334Sentences 10309 6175 33854Chemical 13017 8004 44066Gene 12735 7563 41072

Instances Training Development Test

Total 11953 7653 40887upregulation/activation 595 432

downregulation/inhibition 1827 941agonist 140 104

antagonist 206 182substrate 624 401

Deep learning classifier

shortestpath

Bi-LSTMDropoutWord

embeddingshortestpath

POStags

Bi-LSTMDropoutPOS

embedding Fullyconnected

Outputlayer

shortestpath

Bi-LSTMDropoutDependencyembedding

64unitsdropout=0.1

Embeddings

• Word embeddings– Word2vec (Gensim)– 15 million MEDLINE abstracts – simple tokenization– ~775k words– 6 model parameters

• window = 5/20/50• vector size = 100/300

• POS embeddings (size 200, random init)

• Dependency embeddings (300, random init)

Run configurations

Configuration

Dependencyfeatures SentencefeaturesClass

weightsWord POS Dep Left Middle Right

1 x x x x

2 x x x x x

3 x x x

4 x x x x

5 x x x x x x x

Results

RunDevelopment Test

Precision Recall F-Score Precision Recall F-Score

1 0,6547 0,5403 0,5919 0,6419 0,2577 0,3677

2 0,4856 0,6221 0,5449 0,5156 0,4670 0,4901

3 0,6334 0,5126 0,5664 0,5919 0,2403 0,3418

4 0,4310 0,6092 0,5047 0,4024 0,4193 0,4107

5 0,4999 0,6074 0,5470 0,5738 0,4722 0,5181

Results

RunDevelopment Test

Precision Recall F-Score Precision Recall F-Score

1 0,6547 0,5403 0,5919 0,6419 0,2577 0,3677

2 0,4856 0,6221 0,5449 0,5156 0,4670 0,4901

3 0,6334 0,5126 0,5664 0,5919 0,2403 0,3418

4 0,4310 0,6092 0,5047 0,4024 0,4193 0,4107

5 0,4999 0,6074 0,5470 0,5738 0,4722 0,5181

LinearSVM~0.25F-scoreusing1+2gramsofsamefeatures

Results

CPR:0 CPR:3 CPR:4 CPR:5 CPR:6 CPR:9

CPR:0 292 433 46 77 234

CPR:3 147 258 30 0 1 1

CPR:4 271 19 661 1 4 5

CPR:5 40 1 0 70 0 0

CPR:6 24 0 2 2 122 0

CPR:9 158 3 5 0 0 223

Results

CPR:0 292 433 46 77 234

CPR:3 147 64% 30 0 1 1

CPR:4 271 19 71% 1 4 5

CPR:5 40 1 0 64% 0 0

CPR:6 24 0 2 2 84% 0

CPR:9 158 3 5 0 0 59%

Precision59%- 84%

Results

CPR:0 53% 40% 40% 39% 51%CPR:3 147 258 30 0 1 1

CPR:4 271 19 661 1 4 5

CPR:5 40 1 0 70 0 0

CPR:6 24 0 2 2 122 0

CPR:9 158 3 5 0 0 223

Recall47%- 61%

Conclusions / Future

• Error analysis

• Different kernels

• Network topology

• Hyper-parameters

Thank you!

Sérgio Matosaleixomatos@ua.pthttp://bioinformatics.ua.pt

BioCreative VI WorkshopBethesda, 18-20 October 2017

extracting chemical- protein interactions using long short-term memory networks · 2018-11-26 ·...

Documents

precision-recall-gain curves: pr analysis done...

evaluation from precision recall and f-m

part 7: evaluation of ir...

roc graphs: notes and practical considerations for...

measurement of relevance effectiveness...

lecture 4: logistic regression - github pages...

analyzing the factors affecting the safe maritime ... the...

localization recall precision (lrp): a new...

query - nistc c ! c ed + / f * 7 9 ! ac d 554 . n.5 0.0 0.2...

area under the precision-recall curve: point estimates...

precision and accuracy of nmr structures · f-measure is...

precision-recall curves using information divergence...

the difference between precision-recall and roc curves for...

precision/recall trade-o analysis in abnormal/normal heart...

cs230.stanford.edu · 2018-09-28 · 4.2.3 quantitative we...

leonid e. zhukov - leonid zhukov · evaluation metrics...

a statistical analysis of the precision-recall graph

alternative forms of the rey auditory verbal learning test...

evaluating information retrieval algorithms with...

dr. t.m.rangaswamy · decorated events: precision obtained...