using weakly labeled data to learn models for extracting information from biomedical text mark...

37
Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics Department of Computer Sciences University of Wisconsin U.S.A. [email protected] www.biostat.wisc.edu/~craven

Upload: lionel-singleton

Post on 05-Jan-2016

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Using Weakly Labeled Data to Learn Models for Extracting

Information from Biomedical Text

Mark CravenDepartment of Biostatistics & Medical Informatics

Department of Computer SciencesUniversity of Wisconsin

[email protected]

www.biostat.wisc.edu/~craven

Page 2: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

The Information Extraction Task

Analysis of Yeast PRP20 Mutations and Functional Complementation by theHuman Homologue RCC1, a Protein Involved in the Control of Chromosome Condensation

Fleischmann M, Clark M, Forrester W, Wickens M, Nishimoto T, Aebi M

Mutations in the PRP20 gene of yeast show a pleitropic phenotype, in which both mRNA metabolishm and nuclear structure are affected. SRM1 mutants, defective in the same gene, influence the signal transduction pathway for the pheromone response . . .By immunofluorescence microscopy the PRP20 protein was localized in the nucleus.Expression of the RCC1 protein can complement the temperature-sensitive phenotypeof PRP20 mutants, demonstrating the functional similarity of the yeast and mammalian proteins

protein(PRP20)subcellular-localization(PRP20, nucleus)

Page 3: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Motivation

• assisting in the construction and updating of databases

• providing structured summaries for queries

What is known about protein X (subcellular & tissue localization, associations with diseases, interactions with drugs, …)?

• assisting scientific discovery by detecting previously unknown relationships, annotating experimental data

Page 4: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Three Themes in Our IE Research

1. Using “weakly” labeled training data

2. Representing sentence structure in learned models

3. Combining evidence when making predictions

Page 5: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

1. Using “Weakly” Labeled Data

• why use machine learning methods in building information-extraction systems?– hand-coding IE systems is expensive, time-

consuming– there is a lot of data that can be leveraged

• where do we get a training set?– by having someone hand-label data (expensive)– by coupling tuples in an existing database with

relevant documents (cheap)

Page 6: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

“Weakly” Labeled Training Data

• to get positive examples, match DB tuples to passages of text referencing constants in tuples

YPD database MEDLINE abstractsP1, L1

P2, L2

P3, L3

…P1…L1…

…P2…L2…

…P1…L1…

…L3…P3…

Page 7: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Weakly Labeled Training Data

In addition to its role in early vacuole inheritance, VAC8p is required to target aminopeptidase I from the cytoplasm to the vacuole.

In analogy, VAC8p may link the vacuole to actin during vacuole partitioning.

VAC8p is a 64-kD protein found on the vacuole membrane, a site consistent with its role in vacuole inheritance.

• the labeling is weak in that many sentences with co-occurrences wouldn’t be considered positive examples if we were hand-labeling them

• consider the sentences associated with the relation subcellular-localization(VAC8p, vacuole) after weak labeling

Page 8: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Learning Context Patterns for Recognizing Protein Names

…gene encoding <p>gamma-glutamyl kinase</p> was……recognized genes encoding <p>vimentin</p>, heat……found that <p>E2F</p> binds specifically……<p>IleRS</p> binds to the acceptor……of <p>CPB II</p> binds 1 mol of……purified C/<p>EBP</p> binds at the same position……which interacts with <p>CD4</p>: both……14-3-3tau interacts with <p>protein kinase C mu</p>, a subtype…

selections from the training corpus

encoding [X] 2/4

[X] binds 4/5

interacts with [X] 2/6

• We use AutoSlog [Riloff ’96] to find “triggers” that commonly occur before and after tagged proteins in a training corpus

Page 9: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

“Weak” Labeling Example

Two distinct forms of oxidases catalysing the oxidative deamidation of D-alpha-amino acids have been identified in human tissues: <p>D-amino acid oxidase</p> and <p>D-aspartate oxidase</p>. The enzymes differ in their electrophoretic properties, tissue distribution, binding with flavine adenine denucleotide, heat stability, molecular size and possibly in subunit structure. Neither enzyme exhibits genetic polymorphism in European populations, but a rare electrophoretic variant phenotype (<p>DASOX</p> 2-1) was identified which suggests that the <p>DASOX</p> locus is autosomal and independentof the <p>DAMOX</p>locus.

...D-AKAP-2

D-amino acid oxidaseD-aspartate oxidase

D-dopachrome tautomerase…

DAG kinase zetaDAMOXDASOX

DATDB83 protein

PubMed abstract

SwissProt dictionary

Page 10: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Protein Name Extraction Approach

Two distinct forms of oxidases catalysing the oxidative deamidation of D-alpha-amino acids have been identified in human

tissues: D-amino acid oxidase and

encoding [X]

[X] binds

interacts with [X]

select noun phrasesthat match Autoslog patterns

classify noun phrasesusing a naïve Bayes model

extract positiveclassifications

D-amino acid oxidase

Page 11: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Experimental Evaluation

• hypothesis: we get more accurate models by using weakly labeled data in addition to manually labeled data

• models use Autoslog-induced context patterns + naïve Bayes on morphological/syntax features of candidate names

• compare predictive accuracy resulting from– fixed amount of hand-labeled data– varying amounts of weakly labeled data + hand-

labeled data

Page 12: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Extraction Accuracy: Yapex Data Set

FNTP

TP

FPTP

TP

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

NB model onlyNB + Autoslog: 0 weak abstracts

NB + Autoslog: 90 weak abstractsNB + Autoslog: 2,000 weak abstracts

NB + Autoslog: 25,100 weak abstracts

Page 13: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Extraction Accuracy: Texas Data Set

FNTP

TP

FPTP

TP

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

NB model onlyNB + Autoslog: 0 weak abstracts

NB + Autoslog: 1800 weak abstractsNB + Autoslog: 2,000 weak abstracts

NB + Autoslog: 25,100 weak abstracts

Page 14: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

2. Representing Sentence Structure in Learned Models

• hidden Markov models (HMMs) have proven to be perhaps the best family of methods for learning IE models

• typically these HMMs have a “flat” structure, and are able to represent relatively little about grammatical structure

• how can we provide HMMs with more information about sentence structure?

Page 15: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Hidden Markov Models: Example

Pr(“... the Bed1 protein ...” | ... q1,q4,q2 ...)

1

.4

.3

.2

.1 .4

.2

.3/.6

.1/.8

.1

.1 .8

.2

.1q1

q4

q3start endq2

q5

the .001protein .00005...

the .00001protein .00002...bed1 .001

the .007protein .02...

the .0001protein .03......

the .0001protein .0003...

q1

q4

q3start endq2

q5

1

.4

.3

.2

.1 .4

.2

.3/.6

.1/.8

.1

.1 .8

.2

.1

the .001protein .00005...

the .00001protein .00002...bed1 .001

the .007protein .02...

the .0001protein .03......

the .0001protein .0003...

Page 16: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Hidden Markov Models for Information Extraction

• there are efficient algorithms for doing the following with HMMs:– determining the likelihood of a sentence given a

model– determining the most likely path through a model

for a sentence– setting the parameters of the model to maximize

the likelihood of a set of sentences

Page 17: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Representing Sentences

sentence

noun phrase verb phrase

c_m

noun phrase verb phrase prep phrase

prep

noun phrase

unkart

clause clause

adjective noun verb unk cop verb

Our results suggest that Bed1 is found in the ER

c_m

protein

noun

• we first process sentences by analyzing them with a shallow parser (Sundance, [Riloff et al., 98])

Page 18: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Hierarchical HMMs for IE(Part 1)

NP-SEGMENT

PREP

PROTEINNP-SEGMENT

LOCATIONNP-SEGMENT

START END

• [Ray & Craven, IJCAI 01; Skounakis et al, IJCAI 03]• states have types, emit phrases• some states have labels (PROTEIN, LOCATION)• our models have 25 states at this level

Page 19: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Hierarchical HMMs for IE (Part 2)

NP-SEGMENT

PREP

PROTEINNP-SEGMENT

LOCATIONNP-SEGMENT

START END

NP-SEGMENT

PREP

START END

positive model

null model

Page 20: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Hierarchical HMMs for IE (Part 3)

PREP

NP-SEGMENT

PROTEINNP-SEGMENT

LOCATIONNP-SEGMENT

START END

START ENDALL START BEFORE

BETWEEN

LOCATION AFTER END

Pr(the) = 0.0003Pr(and) = 0.0002…Pr(cell) = 0.0001

Page 21: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Hierarchical HMMs

PP-SEGMENT

VP-SEGMENT

PROTEINNP-SEGMENT

LOCATIONNP-SEGMENT

START END

START ENDALL START BEFORE

BETWEEN

LOCATION AFTER END

“. . . is found in the ER”consider emitting:

is found

in

the ER

Page 22: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Extraction with our HMMs

NP-SEGMENT

PP-SEGMENT

PROTEINNP-SEGMENT

LOCATIONNP-SEGMENT

START END

NP-SEGMENT

PP-SEGMENT

START END

• extract a relation instance if– sentence is more probable

under positive model

– Viterbi (most probable) path goes through special extraction states

Page 23: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Representing More Local Context

• we can have the word-level states represent more about the local context of each emission

• partition sentence into overlapping trigrams

“... the/ART Bed1/UNK protein/N is/COP located/V ...”

11, pw 00 , pw 11, pw

11, pw 00 , pw 11, pw

11, pw 00 , pw 11, pw

Page 24: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Representing More Local Context

• states emit trigrams with probability:

• note the independence assumption above: we compensate for this naïve assumption by using a discriminative training method [Krogh ’94] to learn parameters

101101 ,,,,, pppwwwt

)Pr()Pr()Pr()Pr()Pr()Pr()Pr( 101101 pppwwwt

Page 25: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Experimental Evaluation

• hypothesis: we get more accurate models by using a richer representation of sentence structure in HMMs

• compare predictive accuracy of various types of representations– hierarchical w/context features– hierarchical– phrases– tokens w/part of speech– tokens

• 5-fold cross validation on 3 data sets

more grammaticalinformation

Page 26: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Weakly Labeled Data Sets for Learning to Extract Relations• subcellular_localization(PROTEIN, LOCATION)

– YPD database– 769 positive, 6193 negative sentences– 939 tuples (402 distinct)

• disorder_association(GENE, DISEASE)– OMIM database– 829 positive, 11685 negative sentences– 852 tuples (143 distinct)

• protein_protein_interaction(PROTEIN, PROTEIN)– MIPS database– 5446 positive, 41377 negative– 8088 (819 distinct)

Page 27: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Extraction Accuracy (YPD)

FNTP

TP

FPTP

TP

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Context HHMMsHHMMs

Phrase HMMsPOS HMMs

Token HMMs

Page 28: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Extraction Accuracy (MIPS)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Context HHMMsHHMMs

Phrase HMMsPOS HMMs

Token HMMs

Page 29: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Extraction Accuracy (OMIM)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Context HHMMsHHMMs

Phrase HMMsPOS HMMs

Token HMMs

Page 30: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

3. Combining Evidence when Making Predictions

• in processing a large corpus, we are likely to see the same entities, relations in multiple places

• in making extractions, we should combine evidence across different occurrences/contexts in we see some entity/relation

Page 31: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Combining Evidence:Organizing Predictions into Bags

CAT is a 64-kD protein…

CAT was established to be…

…were removed from cat brains.

…the cat activated the mouse...

actualpredictedoccurrence

b bag in instances of number the be let bnsprediction positive of number the be bp

positives actual of number the be ba

Page 32: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Combining Evidence when Making Predictions

• given a bag of predictions, estimate the probability that the bag contains at least one actual positive example:

),|0Pr( bbb pna

b

b

n

ibbbbb

n

jbbbbb

nianiap

njanjap

0

1

)|Pr(),|Pr(

)|Pr(),|Pr(

Page 33: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Combining Evidence:Estimating Relevant Probabilities

b

b

n

ibbbbb

n

jbbbbb

nianiap

njanjap

0

1

)|Pr(),|Pr(

)|Pr(),|Pr(

can model with twobinomial distributionsbased on estimated TP-rate,FP-rate of model

can do something simple here(e.g. assume uniform priors)or can make estimate thisfrom data w/ a few assumptions

Page 34: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Evidence Combination: Protein-Protein Interactions

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

BECIPSoft-Count

Noisy-ORWM

Soft-ORNC

Page 35: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Evidence Combination: Protein Names

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

BECIPSoft-Count

Noisy-ORWM

Soft-ORNC

Page 36: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Conclusions

• machine learning methods provide a means for learning/refining models for information extraction

• learning is inexpensive when unlabeled/weakly labeled sources can be exploited– learning context patterns for protein names– learning HMMs for relation extraction

• we can learn more accurate models by giving HMMs more information about syntactic structure of sentences– hierarchical HMMs

• we can improve the precision of our predictions by carefully combining evidence across extractions

Page 37: Using Weakly Labeled Data to Learn Models for Extracting Information from Biomedical Text Mark Craven Department of Biostatistics & Medical Informatics

Acknowledgments

my graduate studentsSoumya RayBurr SettlesMarios Skounakis

NIH/NLM grant 1R01 LM07050-01

NSF CAREER grant IIS-0093016