Download - Research in the Verspoor Lab
![Page 1: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/1.jpg)
Karin Verspoor, Ph.D.Faculty, Computational Bioscience ProgramUniversity of Colorado School of Medicine
[email protected]://compbio.ucdenver.edu/Hunter_lab/Verspoor
Research in the Verspoor Lab
![Page 2: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/2.jpg)
Text Mining
•Information extraction from the biomedical literature–Entity recognition and normalization
–Relation and event extraction
•Last time, I promised that we would look at:–Ontologies as constraints for
information extraction
![Page 3: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/3.jpg)
Making BioNLP relevant
•Recognition of OBO terms, relations
•CRAFT corpus (first release later this year)
![Page 4: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/4.jpg)
OpenDMAP extracts typed relations from the
literature •Concept recognition tool– Connect ontological terms to literature instances
– Built on Protégé knowledge representation system
•Language patterns associated with concepts and slots– Patterns can contain text literals, other concepts,
constraints (conceptual or syntactic), ordering information, or outputs of other processing.
– Linked to many text analysis engines via UIMA
•Best performance in BioCreative II IPS task
•>500,000 instances of three predicates (with arguments) extracted from Medline Abstracts
•[Hunter, et al., 2008] http://bionlp.sourceforge.net
![Page 5: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/5.jpg)
OpenDMAP
ontology patterns
OpenDMAP
freetext
extractedinformation
![Page 6: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/6.jpg)
OpenDMAPCyclin E2 interacts with Cdk2 in a functional kinase complex.
<ontology>
Protein protein interaction := [int1] interacts with [int2]
protein protein interaction: interactor1: cyclin E2 interactor2: cdk2
ontology patterns
OpenDMAP
freetext
extractedinformation
![Page 7: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/7.jpg)
OpenDMAP
OpenDMAP
CLASS: protein protein interaction SLOT: interactor1 TYPE: molecule SLOT: interactor2 TYPE: molecule
PROTÉGÉ ONTOLOGY
{c-interact} := [interactor1] interacts with [interactor2]{c-interact} := [interactor1] is bound by [interactor2] …
PATTERNS
![Page 8: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/8.jpg)
![Page 9: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/9.jpg)
![Page 10: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/10.jpg)
![Page 11: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/11.jpg)
![Page 12: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/12.jpg)
BioCreative II Example
• Some BioCreative patterns for interact{c-interact} := [interactor1] {w-is} {w-interact-verb1} {w-
preposition} the? [interactor2];{w-is} := is, are, was, were; {w-interact-verb1} := co-immunoprecipitate, co-immunoprecipitates,
co-immunoprecipitated, co-localize, co-localizes, co-localized;{w-preposition} := among, between, by, of, with, to;
• Matched text:PMID 16494873, SENT_ID 16494873_114
Upon precipitation of the SOX10 protein with anti-HA antibody, Western blot detection revealed expression of UBC9-V5 (25 kDa) in the sample (Fig. 1, line 6), indicating that {UBC9 was co-immunoprecipitated with SOX10}.
INTERACTOR_1: UBC9 resolved to UniprotID: UBC9_RAT INTERACTOR_2: SOX10 resolved to UniProtID: SOX10_RAT {c-interact} := [UBC9_RAT]interactor_1, [SOX10_RAT]interactor_2
![Page 13: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/13.jpg)
BioCreative Results
•359 full-text articles in the test set
•385 interaction assertions produced
•Performance averaged per article (to avoid dominance of a few assertion-heavy articles)
P = 0.39, R = 0.31, F = 0.29
•Best result in the evaluation!–F score 10% higher than next-scoring system
–F score > 3 standard deviations above mean
–Recall 20% higher than next-scoring system
![Page 14: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/14.jpg)
BioCreative conclusions
•Information extraction in biomedical text is hard– Linguistic variability in how concepts are
expressed
– Complex concepts with multiple “slots”
•OpenDMAP advances the state of the art– Use of an ontology grounds the search for
information
– Flexibility of the pattern language to incorporate constraints at different levels (conceptual, lexical, word order, linguistic)
![Page 15: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/15.jpg)
BioNLP’09: Methods
Protein_transport := [TRANSPORTED-ENTITY] translocation @(from {DET}? [TRANSPORT-ORIGIN]) @(to {DET}? [TRANSPORT-DESTINATION])
Bax translocation to mitochondria from the cytosolBax translocation from the cytosol to the mitochondria
Slide credit: Kevin B. Cohen
![Page 16: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/16.jpg)
BioNLP’09: Methods
Protein_transport := [TRANSPORTED-ENTITY] translocation @(from {DET}? [TRANSPORT-ORIGIN]) @(to {DET}? [TRANSPORT-DESTINATION])
Protein (Sequence Ontology)
Cellular Component (Gene Ontology)
Slide credit: Kevin B. Cohen
![Page 17: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/17.jpg)
BioNLP’09: Methods
Slide credit: Kevin B. Cohen
![Page 18: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/18.jpg)
BioNLP’09: Methods• All event types represented as frames
– Elements from ontology constrain every slot
EVENT TYPE: REGULATIONAtLoc: instance of biological_entityCause: instance of proteinCSite: instance of biological_concept or
polypeptide_regionEvent_action: instance of trigger_word or
detection_methodSite: instance of biological_concept or
polypeptide_regionTheme: instance of protein or biological_processToLoc: instance of biological_entity
Sequence Ontology
Molecular Interaction Ontology
Gene OntologyCell Cycle Ontology
Slide credit: Kevin B. Cohen
![Page 19: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/19.jpg)
BioNLP’09: Methods
Partial view of ontology—reality is a little bit less clean
Slide credit: Kevin B. Cohen
![Page 20: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/20.jpg)
BioNLP’09: MethodsEvent type Site AtLoc ToLoc
Binding protein domain (SO), binding site (SO), DNA (SO), chromosome (SO)
Gene expression gene (SO), biological entity (CCO)
tissue (BTO), cell type (CTO), cellular component (GO)
Localization cellular component (GO)
cellular component (GO)
Phosphorylation amino acid (FMA), polypeptide region (SO)
Protein catabolism cellular component (GO)
Transcription gene (SO), biological entity (CCO)
BTO: BRENDA Tissue OntologyCCO: Cell Cycle OntologyCTO: Cell Type OntologyGO: Gene OntologySO: Sequence Ontology
Slide credit: Kevin B. Cohen
![Page 21: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/21.jpg)
BioNLP’09: Methods
•Manual pattern-writing– Before availability of training data: based on native
speaker intuitions, examples from PubMed, and variations on same, as in Cohen et al. (2004)
– After release of training data: based on examination of corpus data, targeting high-frequency predicates only
– Nominalizations predominated; used insights from Cohen et al. (2008) regarding Theme placement
– Protein binding rules re-used from BioCreative II protein-protein interaction task
– Eschewed use of wildcards
Slide credit: Kevin B. Cohen
![Page 22: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/22.jpg)
BioNLP’09: ResultsOur system Best team Best P/R/F
P R F P R F P R F
Task 1 71.81 13.45 22.66 58.48 46.73 51.95 71.81 46.73 51.95
Task 2 70.97 13.25 43.12 54.08 35.86 43.12 70.97 35.86 43.12
Task 3 57.40 12.33 20.30 60.83 32.68 42.52 60.83 32.68 42.52
Task 1: P 10 points higher than second-highestTask 2: P 14 points higher than second-highestTask 3: P 3.4 points lower than highest (3/6)
Slide credit: Kevin B. Cohen
![Page 23: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/23.jpg)
BioNLP’09: Results
P R F
Official results 71.81 13.45 22.66
With bug fixes 67.19 17.38 27.10
Still the highest precision (#2 was 62.21)
Unofficial results: contribution of bug repairs
Slide credit: Kevin B. Cohen
![Page 24: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/24.jpg)
BioNLP’09: Results
•Contribution of coördination-handling–Bug-fixed results: F 27.62 (Task 1)
–Without coordination-handling: F 24.72
–Decrease in F of 2.9 without coördination-handling
Slide credit: Kevin B. Cohen
![Page 25: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/25.jpg)
Syntax helps• 125I-labeled C3b was covalently deposited on CR2, when
hemolytically active 125I-labeled C3 was added to Raji cells preincubated with iC3, factor B, properdin, and factor D, thus proving functionality of CR2-bound C3 convertase. <cr2> BINDS <c3 convertase>
•
CD8alpha(alpha) binds one HLA-A2/peptide molecule, interfacing with the alpha2 and alpha3 domains of HLA-A2 and also contacting beta2-microglobulin. <cd8alpha ( alpha )> BINDS <hla a2 / peptide molecule>
• The binding of 109Cd to metallothionein and the thiol density of the protein were determined after incubation of a purified Zn/Cd-metallothionein preparation with either hydrogen peroxide alone, or with a number of free radical generating systems. <109cd> BINDS <metallothionein>
• Although these shifts in alpha3 may provide a synergistic modulation of affinity, the binding of CD8 to MHC is clearly consistent with an avidity-based contribution from CD8 to TCR- peptide-MHC interactions. <Cd8> BINDS <major histocompatibility complex>
![Page 26: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/26.jpg)
More complex examples•Complex noun phrases• The inactive C3 (iC3), which forms spontaneously in serum in low
amounts by reaction of native C3 with H2O, binds noncovalently to the N-terminal part of CR2. <inactive c3> BINDS <cr2>
• RelB binds transcriptionally active kappaB motifs in the TNF-alpha promoter in normal cells, and in vitro studies with macrophages isolated from RelB- deficient animals revealed impaired production of TNF-alpha in response to LPS and IFN-gamma. <relb> BINDS <tnf - alpha promoter>
•Negation• TNP-BSA, however, did not bind to the CD4 receptor.
<trinitrophenyl-bovine serum albumin> DOES_NOT_BIND <cd4 receptor>• Similarly, when cells expressing the wild type FSHR were treated
with tunicamycin to prevent N-linked glycosylation, the resulting nonglycosylated FSHR was not able to bind FSH. <resulting nonglycosylated fsh receptor> DOES_NOT_BIND <follicle-stimulating hormone>
![Page 27: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/27.jpg)
Coordination isparticularly hard
In contrast both the S4GGnM-R and the Man-R are able to bind Man-BSA.
<mannose receptor> BINDS <man bsa> <s4ggnm - r> BINDS <man bsa>
Purified recombinant NC1, like authentic NC1, also bound specifically to fibronectin, collagen type I, and a laminin 5/6 complex.<authentic nc1> BINDS <laminin 5 / 6 complex><authentic nc1> BINDS <collagen type I><authentic nc1> BINDS <fibronectin><purified recombinant nc1> BINDS <laminin 5 / 6 complex><purified recombinant nc1> BINDS <collagen type I><purified recombinant nc1> BINDS <fibronectin>
The nonvisual arrestins, beta-arrestin and arrestin3, but not visual arrestin, bind specifically to a glutathione S-transferase-clathrin terminal domain fusion protein. *<Arrestin3> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><beta arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><nonvisual arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein>
![Page 28: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/28.jpg)
BioNLP Shared Task ‘11
•Extension of BioNLP’09 tasks–Generalization to full text (from abstracts)
–Additional event types: post-translational modifications and catalysis
•Methods:–Based on empirically derived patterns
–Derived from training data + manual refinement
–Using dependency relations (syntax)
–Work of Haibin Liu (postdoc)
![Page 29: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/29.jpg)
Integrating background knowledge
•Can improve OpenDMAP precision with minimal cost to recall–Take advantage of background knowledge
–Tighten constraints on slot fillers in the ontology
–No change to existing patterns
•Proof of concept:–Distinguish among several types of protein
activation (enzyme and receptor) in GeneRIFs
–Utilize Gene Ontology annotations
![Page 30: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/30.jpg)
Refining selectional restrictions
TP: [GeneRIF 104155 ]an ER stress induces the activation of [caspase-12_protein
- catalytic activity]activated_entity via [caspase-3_protein]activator
prevented FP: [GeneRIF 105594]factor Xa can induce mesangial cell proliferation through the activation of ERK_protein via PAR2_protein in mesangial cells
![Page 31: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/31.jpg)
Results
OriginalAdditionalMemory
Difference
EnzymeEvents
Precision 0.24 0.37 0.13Recall 0.27 0.20 -0.07
F-measure 0.26 0.26 0.00
ReceptorEvents
Precision 0.08 0.34 0.26Recall 0.17 0.12 -0.05
F-measure 0.11 0.18 0.07
TotalPrecision 0.16 0.36 0.20
Recall 0.24 0.18 -0.06F-measure 0.19 0.24 0.05
![Page 32: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/32.jpg)
Biological entities
•Genes (and their products) are particularly valuable to recognize, but are not the only entities of interest:–Diseases
–Drugs, Chemicals, and other treatments
–Anatomical and other locations
–Time and temporal relationships
–Methods and evidence
–Molecular functions, biological processes
![Page 33: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/33.jpg)
Biological Concept Recognition
![Page 34: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/34.jpg)
Two dictionary-based tools
tested against CRAFT•UIMA ConceptMapperhttp://incubator.apache.org/uima/sandbox.html#concept.mapper.annotator
– stemming and case matching relaxation
– non-contiguous spans
– ignore stopwords
– order-independent lookup
•Open Biomedical Annotatorhttp://bioportal.bioontology.org/annotator
– ignore stopwords
– partial word matches
![Page 35: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/35.jpg)
Best run results
• CM/CTO: stemming + FindAllMatches: false
• OBA/CTO: using default stop words
• CM/GO_CC: stemming + caseMatch: insensitive
• CM/ChEBI: caseMatch: sensitive
![Page 36: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/36.jpg)
Concept Matching Conclusions
•The kinds of terms in the ontology matter
•The strategies used in the dictionary matching tools matter
•OpenDMAP will support strategies that go beyond dictionary matching …
![Page 37: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/37.jpg)
Evaluation via Test Suite• Big picture: How to evaluate ontology concept
recognition systems?• Traditional approach: “corpus”• Expensive• Time-consuming to produce• Redundancy for some things…• …underrepresentation of others
• Immediate (narrow) goal of this work: Use techniques from software testing and descriptive linguistics to build test suites that:– Control test data– Eliminate redundancy– Systematic coverage (Oepen 1998)
• Immediate (broad) goal of this work: Are there general principles for test suite design?
Slide credit: Kevin B. Cohen
![Page 38: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/38.jpg)
Methods
•Steps: develop “catalogue” of dimensions along which terms vary
•Use insights from linguistics and from how we know concept recognition systems work–Structural aspects: length
–Content aspects: typography, orthography, lexical contents (function words)…
•…to build a structured set of test cases
•Also compare to other test suite work (Cohen et al. 2004) to look for common principles
Slide credit: Kevin B. Cohen
![Page 39: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/39.jpg)
Structured test suite
Canonical
• GO:0000133 Polarisome
• GO:0000108 Repairosome
• GO:0000786 Nucleosome
• GO:0001660 Fever
• GO:0001726 Ruffle
• GO:0005623 Cell
• GO:0005694 Chromosome
• GO:0005814 Centriole
• GO:0005874 Microtubule
Non-canonical
• GO:0000133 Polarisomes
• GO:0000108 Repairosomes
• GO:0000786 Nucleosomes
• GO:0001660 Fevers
• GO:0001726 Ruffles
• GO:0005623 Cells
• GO:0005694 Chromosomes
• GO:0005814 Centrioles
• GO:0005874 Microtubules
indution of apoptosis -> apoptosis induction (Syntax)cell migration -> cell migrated (Part of speech)ensheathment of neurons -> ensheathment of some neurons
Slide credit: Kevin B. Cohen
![Page 40: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/40.jpg)
Methods/Results
•Gene Ontology, revision 9/24/2009
•Canonical: 188
•Non-canonical: 117
•Observation: –5:1 “dirty” versus 5:1 “clean” is mark of
“mature” testing
•Applied publicly available concept recognition systemSlide credit: Kevin B. Cohen
![Page 41: Research in the Verspoor Lab](https://reader035.vdocuments.us/reader035/viewer/2022070408/568143e2550346895db06a32/html5/thumbnails/41.jpg)
Results
•97.9% of canonical terms were recognized–All exceptions contain the word in
•No non-canonical terms were recognized
•What would it take to recognize the error pattern with canonical terms with a corpus-based approach??
•General principles: Length, ortho/typography (numerals/punctuation), function/stopwords, syntactic context
Slide credit: Kevin B. Cohen