using weakly labeled data to learn models for extracting information from biomedical text mark...
TRANSCRIPT
Using Weakly Labeled Data to Learn Models for Extracting
Information from Biomedical Text
Mark CravenDepartment of Biostatistics & Medical Informatics
Department of Computer SciencesUniversity of Wisconsin
www.biostat.wisc.edu/~craven
The Information Extraction Task
Analysis of Yeast PRP20 Mutations and Functional Complementation by theHuman Homologue RCC1, a Protein Involved in the Control of Chromosome Condensation
Fleischmann M, Clark M, Forrester W, Wickens M, Nishimoto T, Aebi M
Mutations in the PRP20 gene of yeast show a pleitropic phenotype, in which both mRNA metabolishm and nuclear structure are affected. SRM1 mutants, defective in the same gene, influence the signal transduction pathway for the pheromone response . . .By immunofluorescence microscopy the PRP20 protein was localized in the nucleus.Expression of the RCC1 protein can complement the temperature-sensitive phenotypeof PRP20 mutants, demonstrating the functional similarity of the yeast and mammalian proteins
protein(PRP20)subcellular-localization(PRP20, nucleus)
Motivation
• assisting in the construction and updating of databases
• providing structured summaries for queries
What is known about protein X (subcellular & tissue localization, associations with diseases, interactions with drugs, …)?
• assisting scientific discovery by detecting previously unknown relationships, annotating experimental data
Three Themes in Our IE Research
1. Using “weakly” labeled training data
2. Representing sentence structure in learned models
3. Combining evidence when making predictions
1. Using “Weakly” Labeled Data
• why use machine learning methods in building information-extraction systems?– hand-coding IE systems is expensive, time-
consuming– there is a lot of data that can be leveraged
• where do we get a training set?– by having someone hand-label data (expensive)– by coupling tuples in an existing database with
relevant documents (cheap)
“Weakly” Labeled Training Data
• to get positive examples, match DB tuples to passages of text referencing constants in tuples
YPD database MEDLINE abstractsP1, L1
P2, L2
P3, L3
…P1…L1…
…P2…L2…
…P1…L1…
…L3…P3…
Weakly Labeled Training Data
In addition to its role in early vacuole inheritance, VAC8p is required to target aminopeptidase I from the cytoplasm to the vacuole.
In analogy, VAC8p may link the vacuole to actin during vacuole partitioning.
VAC8p is a 64-kD protein found on the vacuole membrane, a site consistent with its role in vacuole inheritance.
• the labeling is weak in that many sentences with co-occurrences wouldn’t be considered positive examples if we were hand-labeling them
• consider the sentences associated with the relation subcellular-localization(VAC8p, vacuole) after weak labeling
Learning Context Patterns for Recognizing Protein Names
…gene encoding <p>gamma-glutamyl kinase</p> was……recognized genes encoding <p>vimentin</p>, heat……found that <p>E2F</p> binds specifically……<p>IleRS</p> binds to the acceptor……of <p>CPB II</p> binds 1 mol of……purified C/<p>EBP</p> binds at the same position……which interacts with <p>CD4</p>: both……14-3-3tau interacts with <p>protein kinase C mu</p>, a subtype…
selections from the training corpus
encoding [X] 2/4
[X] binds 4/5
interacts with [X] 2/6
• We use AutoSlog [Riloff ’96] to find “triggers” that commonly occur before and after tagged proteins in a training corpus
“Weak” Labeling Example
Two distinct forms of oxidases catalysing the oxidative deamidation of D-alpha-amino acids have been identified in human tissues: <p>D-amino acid oxidase</p> and <p>D-aspartate oxidase</p>. The enzymes differ in their electrophoretic properties, tissue distribution, binding with flavine adenine denucleotide, heat stability, molecular size and possibly in subunit structure. Neither enzyme exhibits genetic polymorphism in European populations, but a rare electrophoretic variant phenotype (<p>DASOX</p> 2-1) was identified which suggests that the <p>DASOX</p> locus is autosomal and independentof the <p>DAMOX</p>locus.
...D-AKAP-2
D-amino acid oxidaseD-aspartate oxidase
D-dopachrome tautomerase…
DAG kinase zetaDAMOXDASOX
DATDB83 protein
…
PubMed abstract
SwissProt dictionary
Protein Name Extraction Approach
Two distinct forms of oxidases catalysing the oxidative deamidation of D-alpha-amino acids have been identified in human
tissues: D-amino acid oxidase and
encoding [X]
[X] binds
interacts with [X]
select noun phrasesthat match Autoslog patterns
classify noun phrasesusing a naïve Bayes model
extract positiveclassifications
D-amino acid oxidase
Experimental Evaluation
• hypothesis: we get more accurate models by using weakly labeled data in addition to manually labeled data
• models use Autoslog-induced context patterns + naïve Bayes on morphological/syntax features of candidate names
• compare predictive accuracy resulting from– fixed amount of hand-labeled data– varying amounts of weakly labeled data + hand-
labeled data
Extraction Accuracy: Yapex Data Set
FNTP
TP
FPTP
TP
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Pre
cisi
on
Recall
NB model onlyNB + Autoslog: 0 weak abstracts
NB + Autoslog: 90 weak abstractsNB + Autoslog: 2,000 weak abstracts
NB + Autoslog: 25,100 weak abstracts
Extraction Accuracy: Texas Data Set
FNTP
TP
FPTP
TP
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Pre
cisi
on
Recall
NB model onlyNB + Autoslog: 0 weak abstracts
NB + Autoslog: 1800 weak abstractsNB + Autoslog: 2,000 weak abstracts
NB + Autoslog: 25,100 weak abstracts
2. Representing Sentence Structure in Learned Models
• hidden Markov models (HMMs) have proven to be perhaps the best family of methods for learning IE models
• typically these HMMs have a “flat” structure, and are able to represent relatively little about grammatical structure
• how can we provide HMMs with more information about sentence structure?
Hidden Markov Models: Example
Pr(“... the Bed1 protein ...” | ... q1,q4,q2 ...)
1
.4
.3
.2
.1 .4
.2
.3/.6
.1/.8
.1
.1 .8
.2
.1q1
q4
q3start endq2
q5
the .001protein .00005...
the .00001protein .00002...bed1 .001
the .007protein .02...
the .0001protein .03......
the .0001protein .0003...
q1
q4
q3start endq2
q5
1
.4
.3
.2
.1 .4
.2
.3/.6
.1/.8
.1
.1 .8
.2
.1
the .001protein .00005...
the .00001protein .00002...bed1 .001
the .007protein .02...
the .0001protein .03......
the .0001protein .0003...
Hidden Markov Models for Information Extraction
• there are efficient algorithms for doing the following with HMMs:– determining the likelihood of a sentence given a
model– determining the most likely path through a model
for a sentence– setting the parameters of the model to maximize
the likelihood of a set of sentences
Representing Sentences
sentence
noun phrase verb phrase
c_m
noun phrase verb phrase prep phrase
prep
noun phrase
unkart
clause clause
adjective noun verb unk cop verb
Our results suggest that Bed1 is found in the ER
c_m
protein
noun
• we first process sentences by analyzing them with a shallow parser (Sundance, [Riloff et al., 98])
Hierarchical HMMs for IE(Part 1)
NP-SEGMENT
PREP
PROTEINNP-SEGMENT
LOCATIONNP-SEGMENT
START END
• [Ray & Craven, IJCAI 01; Skounakis et al, IJCAI 03]• states have types, emit phrases• some states have labels (PROTEIN, LOCATION)• our models have 25 states at this level
Hierarchical HMMs for IE (Part 2)
NP-SEGMENT
PREP
PROTEINNP-SEGMENT
LOCATIONNP-SEGMENT
START END
NP-SEGMENT
PREP
START END
positive model
null model
Hierarchical HMMs for IE (Part 3)
PREP
NP-SEGMENT
PROTEINNP-SEGMENT
LOCATIONNP-SEGMENT
START END
START ENDALL START BEFORE
BETWEEN
LOCATION AFTER END
Pr(the) = 0.0003Pr(and) = 0.0002…Pr(cell) = 0.0001
Hierarchical HMMs
PP-SEGMENT
VP-SEGMENT
PROTEINNP-SEGMENT
LOCATIONNP-SEGMENT
START END
START ENDALL START BEFORE
BETWEEN
LOCATION AFTER END
“. . . is found in the ER”consider emitting:
is found
in
the ER
Extraction with our HMMs
NP-SEGMENT
PP-SEGMENT
PROTEINNP-SEGMENT
LOCATIONNP-SEGMENT
START END
NP-SEGMENT
PP-SEGMENT
START END
• extract a relation instance if– sentence is more probable
under positive model
– Viterbi (most probable) path goes through special extraction states
Representing More Local Context
• we can have the word-level states represent more about the local context of each emission
• partition sentence into overlapping trigrams
“... the/ART Bed1/UNK protein/N is/COP located/V ...”
11, pw 00 , pw 11, pw
11, pw 00 , pw 11, pw
11, pw 00 , pw 11, pw
Representing More Local Context
• states emit trigrams with probability:
• note the independence assumption above: we compensate for this naïve assumption by using a discriminative training method [Krogh ’94] to learn parameters
101101 ,,,,, pppwwwt
)Pr()Pr()Pr()Pr()Pr()Pr()Pr( 101101 pppwwwt
Experimental Evaluation
• hypothesis: we get more accurate models by using a richer representation of sentence structure in HMMs
• compare predictive accuracy of various types of representations– hierarchical w/context features– hierarchical– phrases– tokens w/part of speech– tokens
• 5-fold cross validation on 3 data sets
more grammaticalinformation
Weakly Labeled Data Sets for Learning to Extract Relations• subcellular_localization(PROTEIN, LOCATION)
– YPD database– 769 positive, 6193 negative sentences– 939 tuples (402 distinct)
• disorder_association(GENE, DISEASE)– OMIM database– 829 positive, 11685 negative sentences– 852 tuples (143 distinct)
• protein_protein_interaction(PROTEIN, PROTEIN)– MIPS database– 5446 positive, 41377 negative– 8088 (819 distinct)
Extraction Accuracy (YPD)
FNTP
TP
FPTP
TP
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Pre
cisi
on
Recall
Context HHMMsHHMMs
Phrase HMMsPOS HMMs
Token HMMs
Extraction Accuracy (MIPS)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Pre
cisi
on
Recall
Context HHMMsHHMMs
Phrase HMMsPOS HMMs
Token HMMs
Extraction Accuracy (OMIM)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Pre
cisi
on
Recall
Context HHMMsHHMMs
Phrase HMMsPOS HMMs
Token HMMs
3. Combining Evidence when Making Predictions
• in processing a large corpus, we are likely to see the same entities, relations in multiple places
• in making extractions, we should combine evidence across different occurrences/contexts in we see some entity/relation
Combining Evidence:Organizing Predictions into Bags
CAT is a 64-kD protein…
CAT was established to be…
…were removed from cat brains.
…the cat activated the mouse...
actualpredictedoccurrence
b bag in instances of number the be let bnsprediction positive of number the be bp
positives actual of number the be ba
Combining Evidence when Making Predictions
• given a bag of predictions, estimate the probability that the bag contains at least one actual positive example:
),|0Pr( bbb pna
b
b
n
ibbbbb
n
jbbbbb
nianiap
njanjap
0
1
)|Pr(),|Pr(
)|Pr(),|Pr(
Combining Evidence:Estimating Relevant Probabilities
b
b
n
ibbbbb
n
jbbbbb
nianiap
njanjap
0
1
)|Pr(),|Pr(
)|Pr(),|Pr(
can model with twobinomial distributionsbased on estimated TP-rate,FP-rate of model
can do something simple here(e.g. assume uniform priors)or can make estimate thisfrom data w/ a few assumptions
Evidence Combination: Protein-Protein Interactions
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Pre
cisi
on
Recall
BECIPSoft-Count
Noisy-ORWM
Soft-ORNC
Evidence Combination: Protein Names
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Pre
cisi
on
Recall
BECIPSoft-Count
Noisy-ORWM
Soft-ORNC
Conclusions
• machine learning methods provide a means for learning/refining models for information extraction
• learning is inexpensive when unlabeled/weakly labeled sources can be exploited– learning context patterns for protein names– learning HMMs for relation extraction
• we can learn more accurate models by giving HMMs more information about syntactic structure of sentences– hierarchical HMMs
• we can improve the precision of our predictions by carefully combining evidence across extractions
Acknowledgments
my graduate studentsSoumya RayBurr SettlesMarios Skounakis
NIH/NLM grant 1R01 LM07050-01
NSF CAREER grant IIS-0093016