automatic template creation for biomedical information extraction: theory and practice
DESCRIPTION
Automatic template creation for biomedical information extraction: theory and practice. David Corney UCL Computer Science 2 nd May 2006. Motivation. Biologists need tools to process literature This requires knowledge of domain and of computational linguistics - PowerPoint PPT PresentationTRANSCRIPT
Automatic template creation for biomedical information extraction: theory and practice
David Corney
UCL Computer Science
2nd May 2006
Motivation
• Biologists need tools to process literature• This requires knowledge of domain and of
computational linguistics– Option 1: become a computational linguist– Option 2: collaborate with a computational linguist– Option 3: use a tool that provides linguistic knowledge
• Aim: to “de-skill” template creation – McLinguist
Scale of the problem
• 672,963 citations added to PubMed in 2005
• c. 91% of these citations are “journal articles”
• Mean length of a journal article: 5600 words
• Over 110 words per second are published in life sciences journals
0
200
400
600
800
1999
2000
2001
2002
2003
2004
2005
Year
Cit
atio
ns
add
ed
to
Pu
bM
ed
('000
s)
Background
• BioRAT project – 2001-?– Sponsors: GSK (initially) and BBSRC (now)– Aim: to make biomedical information extraction practical
for life sciences researchers– Uses GATE (ANNIE)– Software is available as a “permanent prototype”
• Variety of groups using / evaluating system
– http://bioinf.cs.ucl.ac.uk/biorat
Document
Database of relevant facts
Swi6 was phosphorylated by Hrr25 kinase...
<Swi6> <was> <phosphorylated> <by> <Hrr25> <kinase>
<Swi6 noun> <phosphorylated verb> <by preposition> <Hrr25 noun> <kinase noun>
<protein> <interaction> <preposition> <protein>
<Swi6 protein> <phosphorylated interaction> <Hrr25 protein>
Protein Protein Interaction
Swi6 Hrr25 phosphorylated
Sentence splitter
Token splitter
Part-of-speech tagging
Pattern recognition
Information extraction
Semantic tagging (named entity recognition)
Making template creation easier
• Biologists are not computational linguists– But have great domain knowledge
• Several approaches being developed– Template design tool– Stupid templates + Smart filters = ...?– A logical framework and its implementation
Option 1: Template authoring tool
• A tool to allow users to create their own templates, without learning a new language
• Need a simple way to define patterns• Need to give rapid feedback • Interface needs to be intuitive
Activation of mitogen-activated protein kinase kinase by v-Raf in NIH 3T3 cells and in vitro.
Activation mitogen-activated protein kinase kinase
v-Raf
Furthermore, the phosphorylation of p34cdc2 by p107wee1 in vitro inhibited the histone H1 kinase activity of p34cdc2.
phosphorylation p34cdc2 p107wee1
Specific activation of cdc25 tyrosine phosphatases by B-type cyclins: evidence for multiple roles of mitotic cyclins.
activation cdc25 tyrosine
Coexpression of CD5 with p56lck in the baculovirus expression system resulted in the phosphorylation of CD5 on tyrosine residues.
phosphorylation CD5 tyrosine
We also demonstrate that Tyk2, from extracts of either IFN alpha-treated human cells or insect cells infected with the recombinant baculoviruses, can catalyze in vitro phosphorylation of GST-IFN-R protein in a specific manner.
phosphorylation GST IFN-R
Wiskott-Aldrich syndrome protein physically associates with Nck through Src homology 3 domains.
associates Nck Src
Authoring tool: Summary
• Easier than writing templates directly– But still requires user to learn a new tool
• Constraints on what can be done• Maybe useful for prototyping
Option 2: Learning filters
• User defines concepts of interest• Software generates a variety of templates
– Very general and imprecise
• Templates applied to a corpus• User then marks results as correct or incorrect• Software then learns ‘filters’ to remove false positives
– May learn to ignore negative findings (“there is no evidence that...”)– Or learn to focus on the start of the paper
Filter learning: Summary
• User has less learning to do than with an authoring tool
• Each decision is relatively straightforward• But requires a lot of effort
– Computer needs many examples before it can successfully learn what is relevant
Moving on
• Three methods for creating templates– Template design tool– Stupid templates + smart filters– A logical framework and its implementation
• Questions?
Option 3: Machine learning via a logical framework
• Fully automatic template creation• Sentence level:
– User provides set of one or more interesting sentences– Computer generalises these to a more abstract pattern
• Paper level:– User provides a set of interesting papers– Computer creates templates that match information in
those papers (and not in an irrelevant corpus)
A Framework for Information Extraction
• Aim: to describe formally the space of all templates for any set of one or more phrases
• We can then use machine learning to search for templates• Formally define “word”, “attribute”, “template”, “fragment”,
“match” etc.• Describe how to create, modify and evaluate templates• Possible search algorithms
Information extraction templates• A template is a pattern of words and their attributes• It is a list of word-attributes that correspond to sequences of words
found in that order• Good templates match interesting fragments (true positives) with as
few irrelevant fragments (false positives) as possible
TEMPLATE example matching fragments
THE ANIMAL SAT the cat sat
the dog sat
DT ANIMAL * some dogs barked
the bees knees
* * * to be or
form new government
Superset generalisation
• Each generalisation creates a new template that matches everything that its parent matches– Including the same true-positive and false-positive fragments
• So if a template has too many false positives, then so will all of its descendents
• We can feed knowledge forward as we evaluate templates– Count the numbers of true positive and false positive matches– Parents’ performance define a lower-bound of children’s
performance
Searching for good templates
• Start with a “seed” phrase– Creates a trivial but precise template
• Create several new templates through generalisation– Increases recall– Each template has at least one true positive
• Need to choose which template to generalise next– Select generalisation that maintains precision
• Compare performance on “relevant” and “irrelevant” corpora– Assume every match in “relevant” corpus is true positive– Assume every match in “irrelevant” corpus is false positive
THE CAT SAT ON THE MATthe cat sat on the mat
THE CAT SAT ON THE MATthe cat sat on the mat
THE ANIMAL SAT ON THE MATthe dog sat on the mat
the unicorn sat on the mat
THE CAT SAT ON THE MATthe cat sat on the mat
THE ANIMAL SAT ON THE MATthe dog sat on the mat
the unicorn sat on the mat
THE ANIMAL VB ON THE MATthe dog walked on the mat
the elephant danced on the mat
THE CAT SAT ON THE MATthe cat sat on the mat
THE CAT SAT ON THE MATthe cat sat on the mat
THE CAT SAT ON THEFLOOR_COVERINGthe cat sat on the rug
the cat sat on the carpet
THE CAT SAT ON THE MATthe cat sat on the mat
THE CAT SAT ON THEFLOOR_COVERINGthe cat sat on the rug
the cat sat on the carpet
DT * * IN DT *some fairies danced on a pin
a cloud rose over a hill
THE CAT SAT ON THE MATthe cat sat on the mat
THE ANIMAL SAT ON THE MATthe dog sat on the mat
the unicorn sat on the mat
THE CAT SAT ON THEFLOOR_COVERINGthe cat sat on the rug
the cat sat on the carpet
THE ANIMAL VB ON THE MATthe dog walked on the mat
the elephant danced on the mat DT * * IN DT *some fairies danced on a pin
a cloud rose over a hill
Categories
Literal
Stem
POS
Gazetteer
Wildcard
Parse tree
Language
...
word1.literalword1.stemword1.pos
word1
word2word3
document1
document2
document3
Corpus
...
...
match fragment 1
fragment 2 Set of results
Set oftemplatesTemplate2
Template1
TemplateElement 1
TemplateElement 2
...
attribute-value 1
...
...
attribute-value 2
attribute-value 3
The framework
Implementation
• Aim is to aid real-world applications– Focus of IE is on biomedical text– Templates as described are already part of the
“BioRAT” system– Search algorithm implemented but still being evaluated
• Start with seed phrase, and generalise gradually
– Some explorations have been carried out
Example results
• Start with:– Seed phrase “Rad53p protein binds to Dbf4p”– Positive corpus: 500 abstracts on protein-protein
interactions (from the DIP database)– Neutral corpus: first 500 abstracts dated September 05
• Results:
Iteration Measured hits Inherited hits Template(TP,FP) (TP,FP)
0 (1,0) (0,0) [LITERAL: Rad53p] [LITERAL: protein] [LITERAL: binds] [LITERAL: to] [LITERAL: Dbf4p]10 (2,0) (1,0) [GAZ: protein, sp_gene] [LITERAL: protein] [GAZ: prot_binding, main] [WILD: *] [GAZ: protein, sp_gene]20 (86,1) (25,1) [GAZ: protein] [WILD: ?] [GAZ: prot_binding, main] [WILD: *] [GAZ: protein, sp_gene]30 (286,4) (247,4) [GAZ: protein] [WILD: ?????] [GAZ: prot_binding, main] [WILD: ???] [GAZ: protein, sp_gene]40 (1202,34) (1202,34) [WILD: ??] [WILD: ?????] [GAZ: prot_binding, main] [WILD: ?????] [GAZ: protein]50 (2621,151) (2621,151) [WILD: ???] [WILD: ?????] [GAZ: prot_binding, main] [WILD: ?????] [WILD: *]
Example results
• After 30 iterations, we get 286 TP and 4 FP• [GAZ: protein] [WILD: ?????] [GAZ: prot_binding, main] [WILD:???]
[GAZ: protein, sp_gene]
• Example matches• Protein kinase C delta associates with and phosphorylates
Stat3 in an interleukin-6-dependent manner.• Furthermore, Stat3 was phosphorylated by PKC delta in vivo
on Ser-727...• ...efficient transcription of yeast AMP biosynthesis genes
requires interaction between Bas1p and Bas2p which is promoted...
• ...RII alpha fused to endonexin II formed dimers but did not bind MAP2.
• A protein interaction map for cell polarity development.
Over-generalisation
• Compromise between true-positives and false-positives– After 50 iterations, we get
2621 TP and 151 FP• [WILD: ???] [WILD: ?????]
[GAZ: prot_binding, main] [WILD:?????] [WILD: *]
0
0.5
1
1.5
2
2.5
3
3.5
4
0 1 2 3 4 5
log(True positives)
log
(Fa
lse
po
sit
ive
s)
Machine learning via framework: Summary
• Requires less effort from the user– Just providing a few examples to get things going
• But less user input may lead to less reliable results• May need to be combined with previous methods• Could also start with several positive examples
– Interactive search– Negative examples
Measuring document similarity
• Create random templates from one document• Search for matching fragments in a second
document• Similar documents will have similar number of
matches• Templates will capture semantics as well as word
frequencies– C.f. vectors of word frequencies and TF.IDF
References
• Corney, D. P. A., Byrne, E.L., Buxton, B. F. and Jones, D. T. (2005) "A Logical Framework for Template Creation and Information Extraction", Foundations of Semantic Oriented Data and Web Mining workshop, part of ICDM2005 (the Fifth IEEE International Conference on Data Mining).
• Corney, D. P. A., Buxton, B. F., Langdon W.B. and Jones, D. T. (2004) "BioRAT: Extracting Biological Information from Full-length Papers", Bioinformatics 2004; 20(17), pp. 3206-13.
• http://www.cs.ucl.ac.uk/staff/d.corney/publications.html
Acknowledgements
• Profs. Bernard Buxton, David Jones (PI)• Framework with Dr. Emma Byrne• Funding and support from
– BBSRC, Inpharmatica and GlaxoSmithKline
• BioRAT (Biological Research Assistant for Text Mining)– http://bioinf.cs.ucl.ac.uk/biorat– [email protected]
Search algorithm
• Maintain sets of unevaluated and evaluated templates• Start with seed phrase• Generalise each term every possible way• Evaluate “most promising” template
– Find matches in positive and neutral corpora– Generate each possible child template
• Repeat for next template
Inheritance
120+50N
215+60N
325+45N
440+55N
010+40N
5?+?N
6?+?N
Key: TP+FPN