automatic template creation for biomedical information extraction: theory and practice

Automatic template creation for biomedical information extraction: theory and practice

David Corney

UCL Computer Science

2nd May 2006

Motivation

• Biologists need tools to process literature• This requires knowledge of domain and of

computational linguistics– Option 1: become a computational linguist– Option 2: collaborate with a computational linguist– Option 3: use a tool that provides linguistic knowledge

• Aim: to “de-skill” template creation – McLinguist

Scale of the problem

• 672,963 citations added to PubMed in 2005

• c. 91% of these citations are “journal articles”

• Mean length of a journal article: 5600 words

• Over 110 words per second are published in life sciences journals

0

200

400

600

800

1999

2000

2001

2002

2003

2004

2005

Year

Cit

atio

ns

add

ed

to

Pu

bM

ed

('000

s)

Background

• BioRAT project – 2001-?– Sponsors: GSK (initially) and BBSRC (now)– Aim: to make biomedical information extraction practical

for life sciences researchers– Uses GATE (ANNIE)– Software is available as a “permanent prototype”

• Variety of groups using / evaluating system

– http://bioinf.cs.ucl.ac.uk/biorat

Document

Database of relevant facts

Swi6 was phosphorylated by Hrr25 kinase...

<Swi6> <was> <phosphorylated> <by> <Hrr25> <kinase>

<Swi6 noun> <phosphorylated verb> <by preposition> <Hrr25 noun> <kinase noun>

<protein> <interaction> <preposition> <protein>

<Swi6 protein> <phosphorylated interaction> <Hrr25 protein>

Protein Protein Interaction

Swi6 Hrr25 phosphorylated

Sentence splitter

Token splitter

Part-of-speech tagging

Pattern recognition

Information extraction

Semantic tagging (named entity recognition)

Making template creation easier

• Biologists are not computational linguists– But have great domain knowledge

• Several approaches being developed– Template design tool– Stupid templates + Smart filters = ...?– A logical framework and its implementation

Option 1: Template authoring tool

• A tool to allow users to create their own templates, without learning a new language

• Need a simple way to define patterns• Need to give rapid feedback • Interface needs to be intuitive

Activation of mitogen-activated protein kinase kinase by v-Raf in NIH 3T3 cells and in vitro.

Activation mitogen-activated protein kinase kinase

v-Raf

Furthermore, the phosphorylation of p34cdc2 by p107wee1 in vitro inhibited the histone H1 kinase activity of p34cdc2.

phosphorylation p34cdc2 p107wee1

Specific activation of cdc25 tyrosine phosphatases by B-type cyclins: evidence for multiple roles of mitotic cyclins.

activation cdc25 tyrosine

Coexpression of CD5 with p56lck in the baculovirus expression system resulted in the phosphorylation of CD5 on tyrosine residues.

phosphorylation CD5 tyrosine

We also demonstrate that Tyk2, from extracts of either IFN alpha-treated human cells or insect cells infected with the recombinant baculoviruses, can catalyze in vitro phosphorylation of GST-IFN-R protein in a specific manner.

phosphorylation GST IFN-R

Wiskott-Aldrich syndrome protein physically associates with Nck through Src homology 3 domains.

associates Nck Src

Authoring tool: Summary

• Easier than writing templates directly– But still requires user to learn a new tool

• Constraints on what can be done• Maybe useful for prototyping

Option 2: Learning filters

• User defines concepts of interest• Software generates a variety of templates

– Very general and imprecise

• Templates applied to a corpus• User then marks results as correct or incorrect• Software then learns ‘filters’ to remove false positives

– May learn to ignore negative findings (“there is no evidence that...”)– Or learn to focus on the start of the paper

Filter learning: Summary

• User has less learning to do than with an authoring tool

• Each decision is relatively straightforward• But requires a lot of effort

– Computer needs many examples before it can successfully learn what is relevant

Moving on

• Three methods for creating templates– Template design tool– Stupid templates + smart filters– A logical framework and its implementation

• Questions?

Option 3: Machine learning via a logical framework

• Fully automatic template creation• Sentence level:

– User provides set of one or more interesting sentences– Computer generalises these to a more abstract pattern

• Paper level:– User provides a set of interesting papers– Computer creates templates that match information in

those papers (and not in an irrelevant corpus)

A Framework for Information Extraction

• Aim: to describe formally the space of all templates for any set of one or more phrases

• We can then use machine learning to search for templates• Formally define “word”, “attribute”, “template”, “fragment”,

“match” etc.• Describe how to create, modify and evaluate templates• Possible search algorithms

Information extraction templates• A template is a pattern of words and their attributes• It is a list of word-attributes that correspond to sequences of words

found in that order• Good templates match interesting fragments (true positives) with as

few irrelevant fragments (false positives) as possible

TEMPLATE example matching fragments

THE ANIMAL SAT the cat sat

the dog sat

DT ANIMAL * some dogs barked

the bees knees

* * * to be or

form new government

Superset generalisation

• Each generalisation creates a new template that matches everything that its parent matches– Including the same true-positive and false-positive fragments

• So if a template has too many false positives, then so will all of its descendents

• We can feed knowledge forward as we evaluate templates– Count the numbers of true positive and false positive matches– Parents’ performance define a lower-bound of children’s

performance

Searching for good templates

• Start with a “seed” phrase– Creates a trivial but precise template

• Create several new templates through generalisation– Increases recall– Each template has at least one true positive

• Need to choose which template to generalise next– Select generalisation that maintains precision

• Compare performance on “relevant” and “irrelevant” corpora– Assume every match in “relevant” corpus is true positive– Assume every match in “irrelevant” corpus is false positive

THE CAT SAT ON THE MATthe cat sat on the mat


THE ANIMAL SAT ON THE MATthe dog sat on the mat

the unicorn sat on the mat




THE ANIMAL VB ON THE MATthe dog walked on the mat

the elephant danced on the mat


THE CAT SAT ON THEFLOOR_COVERINGthe cat sat on the rug

the cat sat on the carpet




DT * * IN DT *some fairies danced on a pin

a cloud rose over a hill






THE ANIMAL VB ON THE MATthe dog walked on the mat

the elephant danced on the mat DT * * IN DT *some fairies danced on a pin

a cloud rose over a hill

Categories

Literal

Stem

POS

Gazetteer

Wildcard

Parse tree

Language

...

word1.literalword1.stemword1.pos

word1

word2word3

document1

document2

document3

Corpus

...

...

match fragment 1

fragment 2 Set of results

Set oftemplatesTemplate2

Template1

TemplateElement 1

TemplateElement 2

...

attribute-value 1

...

...

attribute-value 2

attribute-value 3

The framework

Implementation

• Aim is to aid real-world applications– Focus of IE is on biomedical text– Templates as described are already part of the

“BioRAT” system– Search algorithm implemented but still being evaluated

• Start with seed phrase, and generalise gradually

– Some explorations have been carried out

Example results

• Start with:– Seed phrase “Rad53p protein binds to Dbf4p”– Positive corpus: 500 abstracts on protein-protein

interactions (from the DIP database)– Neutral corpus: first 500 abstracts dated September 05

• Results:

Iteration Measured hits Inherited hits Template(TP,FP) (TP,FP)

0 (1,0) (0,0) [LITERAL: Rad53p] [LITERAL: protein] [LITERAL: binds] [LITERAL: to] [LITERAL: Dbf4p]10 (2,0) (1,0) [GAZ: protein, sp_gene] [LITERAL: protein] [GAZ: prot_binding, main] [WILD: *] [GAZ: protein, sp_gene]20 (86,1) (25,1) [GAZ: protein] [WILD: ?] [GAZ: prot_binding, main] [WILD: *] [GAZ: protein, sp_gene]30 (286,4) (247,4) [GAZ: protein] [WILD: ?????] [GAZ: prot_binding, main] [WILD: ???] [GAZ: protein, sp_gene]40 (1202,34) (1202,34) [WILD: ??] [WILD: ?????] [GAZ: prot_binding, main] [WILD: ?????] [GAZ: protein]50 (2621,151) (2621,151) [WILD: ???] [WILD: ?????] [GAZ: prot_binding, main] [WILD: ?????] [WILD: *]

Example results

• After 30 iterations, we get 286 TP and 4 FP• [GAZ: protein] [WILD: ?????] [GAZ: prot_binding, main] [WILD:???]

[GAZ: protein, sp_gene]

• Example matches• Protein kinase C delta associates with and phosphorylates

Stat3 in an interleukin-6-dependent manner.• Furthermore, Stat3 was phosphorylated by PKC delta in vivo

on Ser-727...• ...efficient transcription of yeast AMP biosynthesis genes

requires interaction between Bas1p and Bas2p which is promoted...

• ...RII alpha fused to endonexin II formed dimers but did not bind MAP2.

• A protein interaction map for cell polarity development.

Over-generalisation

• Compromise between true-positives and false-positives– After 50 iterations, we get

2621 TP and 151 FP• [WILD: ???] [WILD: ?????]

[GAZ: prot_binding, main] [WILD:?????] [WILD: *]

0

0.5

1

1.5

2

2.5

3

3.5

4

0 1 2 3 4 5

log(True positives)

log

(Fa

lse

po

sit

ive

s)

Machine learning via framework: Summary

• Requires less effort from the user– Just providing a few examples to get things going

• But less user input may lead to less reliable results• May need to be combined with previous methods• Could also start with several positive examples

– Interactive search– Negative examples

Measuring document similarity

• Create random templates from one document• Search for matching fragments in a second

document• Similar documents will have similar number of

matches• Templates will capture semantics as well as word

frequencies– C.f. vectors of word frequencies and TF.IDF

References

• Corney, D. P. A., Byrne, E.L., Buxton, B. F. and Jones, D. T. (2005) "A Logical Framework for Template Creation and Information Extraction", Foundations of Semantic Oriented Data and Web Mining workshop, part of ICDM2005 (the Fifth IEEE International Conference on Data Mining).

• Corney, D. P. A., Buxton, B. F., Langdon W.B. and Jones, D. T. (2004) "BioRAT: Extracting Biological Information from Full-length Papers", Bioinformatics 2004; 20(17), pp. 3206-13.

• http://www.cs.ucl.ac.uk/staff/d.corney/publications.html

Acknowledgements

• Profs. Bernard Buxton, David Jones (PI)• Framework with Dr. Emma Byrne• Funding and support from

– BBSRC, Inpharmatica and GlaxoSmithKline

• BioRAT (Biological Research Assistant for Text Mining)– http://bioinf.cs.ucl.ac.uk/biorat– [email protected]

Search algorithm

• Maintain sets of unevaluated and evaluated templates• Start with seed phrase• Generalise each term every possible way• Evaluate “most promising” template

– Find matches in positive and neutral corpora– Generate each possible child template

• Repeat for next template

Inheritance

120+50N

215+60N

325+45N

440+55N

010+40N

5?+?N

6?+?N

Key: TP+FPN

automatic template creation for biomedical information extraction: theory and practice

Documents