using weakly labeled data to learn models for extracting information from biomedical text mark...

Using Weakly Labeled Data to Learn Models for Extracting

Information from Biomedical Text

Mark CravenDepartment of Biostatistics & Medical Informatics

Department of Computer SciencesUniversity of Wisconsin

[email protected]

www.biostat.wisc.edu/~craven

The Information Extraction Task

Analysis of Yeast PRP20 Mutations and Functional Complementation by theHuman Homologue RCC1, a Protein Involved in the Control of Chromosome Condensation

Fleischmann M, Clark M, Forrester W, Wickens M, Nishimoto T, Aebi M

Mutations in the PRP20 gene of yeast show a pleitropic phenotype, in which both mRNA metabolishm and nuclear structure are affected. SRM1 mutants, defective in the same gene, influence the signal transduction pathway for the pheromone response . . .By immunofluorescence microscopy the PRP20 protein was localized in the nucleus.Expression of the RCC1 protein can complement the temperature-sensitive phenotypeof PRP20 mutants, demonstrating the functional similarity of the yeast and mammalian proteins

protein(PRP20)subcellular-localization(PRP20, nucleus)

Motivation

• assisting in the construction and updating of databases

• providing structured summaries for queries

What is known about protein X (subcellular & tissue localization, associations with diseases, interactions with drugs, …)?

• assisting scientific discovery by detecting previously unknown relationships, annotating experimental data

Three Themes in Our IE Research

1. Using “weakly” labeled training data

2. Representing sentence structure in learned models

3. Combining evidence when making predictions

1. Using “Weakly” Labeled Data

• why use machine learning methods in building information-extraction systems?– hand-coding IE systems is expensive, time-

consuming– there is a lot of data that can be leveraged

• where do we get a training set?– by having someone hand-label data (expensive)– by coupling tuples in an existing database with

relevant documents (cheap)

“Weakly” Labeled Training Data

• to get positive examples, match DB tuples to passages of text referencing constants in tuples

YPD database MEDLINE abstractsP1, L1

P2, L2

P3, L3

…P1…L1…

…P2…L2…

…P1…L1…

…L3…P3…

Weakly Labeled Training Data

In addition to its role in early vacuole inheritance, VAC8p is required to target aminopeptidase I from the cytoplasm to the vacuole.

In analogy, VAC8p may link the vacuole to actin during vacuole partitioning.

VAC8p is a 64-kD protein found on the vacuole membrane, a site consistent with its role in vacuole inheritance.

• the labeling is weak in that many sentences with co-occurrences wouldn’t be considered positive examples if we were hand-labeling them

• consider the sentences associated with the relation subcellular-localization(VAC8p, vacuole) after weak labeling

Learning Context Patterns for Recognizing Protein Names

…gene encoding gamma-glutamyl kinase was……recognized genes encoding vimentin, heat……found that E2F binds specifically……IleRS binds to the acceptor……of CPB II binds 1 mol of……purified C/EBP binds at the same position……which interacts with CD4: both……14-3-3tau interacts with protein kinase C mu, a subtype…

selections from the training corpus

encoding [X] 2/4

[X] binds 4/5

interacts with [X] 2/6

• We use AutoSlog [Riloff ’96] to find “triggers” that commonly occur before and after tagged proteins in a training corpus

“Weak” Labeling Example

Two distinct forms of oxidases catalysing the oxidative deamidation of D-alpha-amino acids have been identified in human tissues: D-amino acid oxidase and D-aspartate oxidase. The enzymes differ in their electrophoretic properties, tissue distribution, binding with flavine adenine denucleotide, heat stability, molecular size and possibly in subunit structure. Neither enzyme exhibits genetic polymorphism in European populations, but a rare electrophoretic variant phenotype (DASOX 2-1) was identified which suggests that the DASOX locus is autosomal and independentof the DAMOXlocus.

...D-AKAP-2

D-amino acid oxidaseD-aspartate oxidase

D-dopachrome tautomerase…

DAG kinase zetaDAMOXDASOX

DATDB83 protein

…

PubMed abstract

SwissProt dictionary

Protein Name Extraction Approach

Two distinct forms of oxidases catalysing the oxidative deamidation of D-alpha-amino acids have been identified in human

tissues: D-amino acid oxidase and

encoding [X]

[X] binds

interacts with [X]

select noun phrasesthat match Autoslog patterns

classify noun phrasesusing a naïve Bayes model

extract positiveclassifications

D-amino acid oxidase

Experimental Evaluation

• hypothesis: we get more accurate models by using weakly labeled data in addition to manually labeled data

• models use Autoslog-induced context patterns + naïve Bayes on morphological/syntax features of candidate names

• compare predictive accuracy resulting from– fixed amount of hand-labeled data– varying amounts of weakly labeled data + hand-

labeled data

Extraction Accuracy: Yapex Data Set

FNTP

TP

FPTP

TP

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

NB model onlyNB + Autoslog: 0 weak abstracts

NB + Autoslog: 90 weak abstractsNB + Autoslog: 2,000 weak abstracts

NB + Autoslog: 25,100 weak abstracts

Extraction Accuracy: Texas Data Set

FNTP

TP

FPTP

TP

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

NB model onlyNB + Autoslog: 0 weak abstracts

NB + Autoslog: 1800 weak abstractsNB + Autoslog: 2,000 weak abstracts

NB + Autoslog: 25,100 weak abstracts

2. Representing Sentence Structure in Learned Models

• hidden Markov models (HMMs) have proven to be perhaps the best family of methods for learning IE models

• typically these HMMs have a “flat” structure, and are able to represent relatively little about grammatical structure

• how can we provide HMMs with more information about sentence structure?

Hidden Markov Models: Example

Pr(“... the Bed1 protein ...” | ... q1,q4,q2 ...)

1

.4

.3

.2

.1 .4

.2

.3/.6

.1/.8

.1

.1 .8

.2

.1q1

q4

q3start endq2

q5

the .001protein .00005...

the .00001protein .00002...bed1 .001


the .0001protein .03......


q1

q4

q3start endq2

q5

1

.4

.3

.2

.1 .4

.2

.3/.6

.1/.8

.1

.1 .8

.2

.1


the .00001protein .00002...bed1 .001


the .0001protein .03......


Hidden Markov Models for Information Extraction

• there are efficient algorithms for doing the following with HMMs:– determining the likelihood of a sentence given a

model– determining the most likely path through a model

for a sentence– setting the parameters of the model to maximize

the likelihood of a set of sentences

Representing Sentences

sentence

noun phrase verb phrase

c_m

noun phrase verb phrase prep phrase

prep

noun phrase

unkart

clause clause

adjective noun verb unk cop verb

Our results suggest that Bed1 is found in the ER

c_m

protein

noun

• we first process sentences by analyzing them with a shallow parser (Sundance, [Riloff et al., 98])

Hierarchical HMMs for IE(Part 1)

NP-SEGMENT

PREP

PROTEINNP-SEGMENT

LOCATIONNP-SEGMENT

START END

• [Ray & Craven, IJCAI 01; Skounakis et al, IJCAI 03]• states have types, emit phrases• some states have labels (PROTEIN, LOCATION)• our models have 25 states at this level

Hierarchical HMMs for IE (Part 2)

NP-SEGMENT

PREP

PROTEINNP-SEGMENT

LOCATIONNP-SEGMENT

START END

NP-SEGMENT

PREP

START END

positive model

null model

Hierarchical HMMs for IE (Part 3)

PREP

NP-SEGMENT

PROTEINNP-SEGMENT

LOCATIONNP-SEGMENT

START END

START ENDALL START BEFORE

BETWEEN

LOCATION AFTER END

Pr(the) = 0.0003Pr(and) = 0.0002…Pr(cell) = 0.0001

Hierarchical HMMs

PP-SEGMENT

VP-SEGMENT

PROTEINNP-SEGMENT

LOCATIONNP-SEGMENT

START END

START ENDALL START BEFORE

BETWEEN

LOCATION AFTER END

“. . . is found in the ER”consider emitting:

is found

in

the ER

Extraction with our HMMs

NP-SEGMENT

PP-SEGMENT

PROTEINNP-SEGMENT

LOCATIONNP-SEGMENT

START END

NP-SEGMENT

PP-SEGMENT

START END

• extract a relation instance if– sentence is more probable

under positive model

– Viterbi (most probable) path goes through special extraction states

Representing More Local Context

• we can have the word-level states represent more about the local context of each emission

• partition sentence into overlapping trigrams

“... the/ART Bed1/UNK protein/N is/COP located/V ...”

11, pw 00 , pw 11, pw

11, pw 00 , pw 11, pw

11, pw 00 , pw 11, pw

Representing More Local Context

• states emit trigrams with probability:

• note the independence assumption above: we compensate for this naïve assumption by using a discriminative training method [Krogh ’94] to learn parameters

101101 ,,,,, pppwwwt

)Pr()Pr()Pr()Pr()Pr()Pr()Pr( 101101 pppwwwt

Experimental Evaluation

• hypothesis: we get more accurate models by using a richer representation of sentence structure in HMMs

• compare predictive accuracy of various types of representations– hierarchical w/context features– hierarchical– phrases– tokens w/part of speech– tokens

• 5-fold cross validation on 3 data sets

more grammaticalinformation

Weakly Labeled Data Sets for Learning to Extract Relations• subcellular_localization(PROTEIN, LOCATION)

– YPD database– 769 positive, 6193 negative sentences– 939 tuples (402 distinct)

• disorder_association(GENE, DISEASE)– OMIM database– 829 positive, 11685 negative sentences– 852 tuples (143 distinct)

• protein_protein_interaction(PROTEIN, PROTEIN)– MIPS database– 5446 positive, 41377 negative– 8088 (819 distinct)

Extraction Accuracy (YPD)

FNTP

TP

FPTP

TP

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Context HHMMsHHMMs

Phrase HMMsPOS HMMs

Token HMMs

Extraction Accuracy (MIPS)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Context HHMMsHHMMs

Phrase HMMsPOS HMMs

Token HMMs

Extraction Accuracy (OMIM)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

Context HHMMsHHMMs

Phrase HMMsPOS HMMs

Token HMMs

3. Combining Evidence when Making Predictions

• in processing a large corpus, we are likely to see the same entities, relations in multiple places

• in making extractions, we should combine evidence across different occurrences/contexts in we see some entity/relation

Combining Evidence:Organizing Predictions into Bags

CAT is a 64-kD protein…

CAT was established to be…

…were removed from cat brains.

…the cat activated the mouse...

actualpredictedoccurrence

b bag in instances of number the be let bnsprediction positive of number the be bp

positives actual of number the be ba

Combining Evidence when Making Predictions

• given a bag of predictions, estimate the probability that the bag contains at least one actual positive example:

),|0Pr( bbb pna

b

b

n

ibbbbb

n

jbbbbb

nianiap

njanjap

0

1

)|Pr(),|Pr(

)|Pr(),|Pr(

Combining Evidence:Estimating Relevant Probabilities

b

b

n

ibbbbb

n

jbbbbb

nianiap

njanjap

0

1

)|Pr(),|Pr(

)|Pr(),|Pr(

can model with twobinomial distributionsbased on estimated TP-rate,FP-rate of model

can do something simple here(e.g. assume uniform priors)or can make estimate thisfrom data w/ a few assumptions

Evidence Combination: Protein-Protein Interactions

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

BECIPSoft-Count

Noisy-ORWM

Soft-ORNC

Evidence Combination: Protein Names

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Pre

cisi

on

Recall

BECIPSoft-Count

Noisy-ORWM

Soft-ORNC

Conclusions

• machine learning methods provide a means for learning/refining models for information extraction

• learning is inexpensive when unlabeled/weakly labeled sources can be exploited– learning context patterns for protein names– learning HMMs for relation extraction

• we can learn more accurate models by giving HMMs more information about syntactic structure of sentences– hierarchical HMMs

• we can improve the precision of our predictions by carefully combining evidence across extractions

Acknowledgments

my graduate studentsSoumya RayBurr SettlesMarios Skounakis

NIH/NLM grant 1R01 LM07050-01

NSF CAREER grant IIS-0093016

using weakly labeled data to learn models for extracting information from biomedical text mark...

Documents