research in the verspoor lab

Karin Verspoor, Ph.D.Faculty, Computational Bioscience ProgramUniversity of Colorado School of Medicine

[email protected]://compbio.ucdenver.edu/Hunter_lab/Verspoor

Research in the Verspoor Lab

Generally speaking…

•Focus on analysis of the biomedical literature

•For the purpose of:–Turning unstructured data (natural

language text) into structured statements

–Taking advantage of the wealth of information in the literature for biological data analysis

•Using (analyzing, building) semantic resources for the biomedical domain

Today: Focus on Ontologies

•Use of the structure of ontologies to understand relations among protein annotations

•Analysis of the term structure of ontologies

•Particular ontology of interest: Gene Ontology

Gene Ontology (GO)

•Taxonomic controlled vocabulary

•~ 16K nodes PGO populated by genes, proteins

•Two orders on PGO: ≤isa,≤has Gene Ontology Consortium (2000): “Gene Ontology: Tool

For the Unification of Biology”, Nature Genetics, 25:25-29

The Gene Ontology: Usage

•33703 terms–20403 biological_process

–2810 cellular_component

–8996 molecular_function

•Gene Annotations for 40+ organisms

•3504 publications in PubMed matching “gene ontology” (3/8/11)

• ISI Web of Knowledge: 5371 refs to GO paper Graph statistics as of June 9, 2009

Protein Function Prediction

•Verspoor, K., Cohn, J., Mniszewski, S., and Joslyn, C. (2006). A Categorization Approach to Automated Ontological Function Annotation. Protein Science, v.15, pp.1544-1549.

http://www.proteinscience.org/cgi/content/abstract/15/6/1544

http://www.proteinscience.org/cgi/content/abstract/15/6/1544

Automated Protein Function Annotation

•Mappings –From regions of sequence,

structure, keyword spaces

– Into regions of biological function space: •taxonomic bio-ontologies of

molecular function

•Characterize formal structure of bio-ontologies:– Order theoretical approaches

– Combinatorial algorithms

POSOLE: POSet Ontology Laboratory Environment

• POSOLE: a general environment for ontology experimentation– Graph representation of an ontology as a POSet

– POSet statistics analysis (e.g. depth, width, average rank)

– Algorithms for node categorization utilizing the structure of the ontology

• First Deployment: Ontology categorization for automated protein function annotation– Function: Gene Ontology node

– Protein: target sequence or Swiss-Prot identifier

– Map proteins to sets of potential Gene Ontology nodes

– Ontology categorization: “clustering” nodes in ontology space to identify the most likely node assignment

• Dual Queries: Text and sequence neighborhoods

POSOLE strategy

•Function Prediction as Categorization of Nearest Neighbors

•Application of POSOC categorization methodology utilizing the Gene Ontology structure to find the best covering nodes given a set of node “hits”

•“Hits” are based on (application-dependent) mappings from neighbors of an input protein to Gene Ontology nodes

•Covering nodes are function annotation predictions

• PosoleRun, core of each application– Load the graph (GO)

– Build a query, a set of query items

– Categorize the query items

• Each application defines its own QueryBuilder

POSOLE architecture

Categorization Task: POSOC“Cluster” Genes in Ontology Space

• Given the Gene Ontology (GO) . . . And mappings to GO nodes . . .

• “Splatter” them over the GO . . . Where do they end up?– Concentrated? -- Dispersed?– Clustered? -- High or low?– Overlapping or distinct?

• Pseudo-distances between comparable nodes to measure vertical separation

• POSOC traverses the structure of the GO, percolating hits upwards, and calculating scores for GO nodes.

• Scores to rank-order nodes with respect to gene locations, balancing:

– Coverage: Covering as many genes as possible

– Specificity: But at the “lowest level” possible• “Cluster” based on non-comparable high score nodes

http://www.c3.lanl.gov/posoc/

Joslyn, Cliff; Mniszewski, Susan; Fulmer, Andy; and Heaton, Gary: (2004) “The Gene Ontology Categorizer”, Bioinformatics, v. 20:s1, pp. 169-177

Order Theoretical Categorization Method

• Represent GO as labeled, finite ordered set

• Given labels (genes) c, e, i . . .

• What node(s) A,B, C, . . . ,K are best to attend to?

– C– {H, J}

– {A, H, J}

POSOLE applications

Application: BioCreAtIvE I, Task 2

Critical Assessment of Information Extraction in Biology

• Automatic assignment of Gene Ontology annotations to human proteins based on a journal publication– Given a Swiss-Prot/TrEMBL protein ID and a document, predict

a GO node to which the protein should be annotated

– Also return the evidence text from the document supporting the annotation

• Strategy: Annotation as Categorization of Document Neighborhood

• Application of POSOC categorization utilizing the Gene Ontology structure to find the best covering nodes given a set of node “hits”

• “Hits” in this case are based on overlaps between input terms and GO node terms (in labels, definitions)

POSOC as applied to context terms

• Collect all terms in a context window of n sentences around any reference to the protein of interest

• Transform an input query into a set of node hits:– Morphologically normalize GO node labels

– Look for any overlaps between input terms and terms in the normalized node labels

– An overlap = a node hit, with strength based on the input weight of the term (from TFIDF)

– Multiple overlaps on a given node count as multiple hits

• POSOC returns a set of GO nodes representing cluster heads for weighted term input set, and data on which input terms contributed to the selection of each cluster head: Annotation predictions

BioLASER:Los Alamos Semantic Event Recognizer for Biology

• Text analysis environment:– Relation

extraction

– Term vector analysis

• Domain-specific and application-specific components

• Markup workflow implementation

Application: CASP-6 Function Prediction

Critical Assessment of Structure Prediction evaluationFunction Prediction subtask

•Automatic assignment of Gene Ontology annotations to target protein sequences

•Strategy: Annotation as Categorization of Sequence Neighborhood

•Application of POSOC categorization utilizing the Gene Ontology structure to find the best covering nodes given a set of node “hits”

•“Hits” in this case are based on known mappings from proteins in the sequence neighborhood of the target to Gene Ontology nodes

CASP architecture

CASP Evaluation• Test set

– proteins with known Gene Ontology mappings

– 4530 SwissProt protein sequences associated from PDB

– Protein to GO Mappings derived from UniProt

• Eliminate PSI-BLAST identity matches from mappings used in prediction

– Matches to protein with the same SwissProt Accession ID

– Matches to protein with an accession ID that maps to the same SwissProt Entry ID

– Matches to protein with an e-value < 10-130 or e-value < max e-value for known identity match

• Goal: compare function predictions made by the system with known functions assigned to each input protein

CASP Evaluation runs

• Baseline Best Blast: Predictions are the GO nodes associated with non-identical protein scoring highest in the PSI-BLAST analysis. All predicted GO nodes are considered to be at rank 1.

• Baseline Full Neighborhood: Predictions are the GO nodes associated with all proteins matched in the PSI-BLAST analysis (with evalue < 10). The predictions are ranked according to the evalue of the corresponding PSI-BLAST match.

• POSOC Best Blast: Inputs to POSOC are the GO nodes associated with non-identical protein scoring highest in the PSI-BLAST analysis, weighted by evalue of the match. POSOC categorizes and ranks these inputs to produce the predictions.

• POSOC Full Neighborhood: Inputs to are the GO nodes associated with all proteins matched in the PSI-BLAST analysis, weighted by evalue of the match.

POSOC categorizes and ranks these inputs to produce the predictions.

POSOC: Full Neighborhood

POSOC: Best Blast

Baseline:Full Neighborhood

Baseline:Best Blast

Evaluation analysis

•Precision/Recall–Precision = % of predictions that are correct

–Recall = % of known predictions that are recovered

•Extension to ranked list of predictions–Consider precision/recall at different ranks

Ontological Distance Metrics

• How “far apart” are p and q?

• Genealogical approach:

• Radius 0: Equals: Direct match

• Radius 1: Nuclear family: Parents, children, siblings

• Radius 2: Extended family: grandparents, grandchildren, cousins, aunts/uncles, nieces/nephews

Evaluation results: Precision

Evaluation results: Recall

Evaluation of Ontological predictions

•Extension to ontological predictions: when does a GO node p in F(x) count as a “match” against a q in G(x)?– What about siblings? Ancestors?

– Partial credit?•Based on proximity

•Based on specificity

•Adapt hierarchical precision/recall measure from Kiritchenko et al 2005

Hierarchical Precision vs. Rank

(Cellular Component branch)


(Molecular Function branch)


(Biological Process branch)

Summary: Protein Function

Prediction• We have constructed the POSOLE architecture, supporting integration of mappings from different spaces into function space

• We utilize the mathematical structure of function space as defined by the Gene Ontology to help identify commonalities and “clusters”, as well as in evaluation

• We have proposed an extension to Kiritchenko et al’s hierarchical precision/recall measure to support comparison of sets of predictions and answers

• The results on CASP function prediction show the promise of the POSOLE and POSOC technologies for automated annotation of protein sequences.

Ontology Quality Assurance

•Verspoor, K., Dvorkin, D., Cohen, K.B., Hunter, L. (2009) Ontology quality assurance through analysis of term transformations. Bioinformatics 25(12):i77-i84.

http://bioinformatics.oxfordjournals.org/cgi/content/full/25/12/i77?ijkey=qvYcpVnIJMjd19Y&keytype=ref

http://bioinformatics.oxfordjournals.org/cgi/content/full/25/12/i77?ijkey=qvYcpVnIJMjd19Y&keytype=ref

Regulation of transcription

Transcription Regulation

Positive regulation of cell migration

Cell migration positive regulation

Key quality concern: Univocality

•Univocality = one voice (Spinoza, 1677)“a shared interpretation of the nature of

reality”(with thanks to David Hill @ Jackson Lab)

•Consistency of expression of concepts

•Regular, compositional, linguistic structure–Facilitates human usability

–Computational tools can utilize this regularity

Quality Assurance in the GO

•Goal: identify violations of univocality

•Problem: the GO is generally very high quality; how to identify the few inconsistencies?

•Hypothesis: violations of univocality will correspond to transformational variants

•Strategy: term transformation & clustering

GO Term Transformation:Abstraction

•Substitution of embedded GO & ChEBI termstoluene oxidation via 3-hydroxytoluene

CTERM oxidation via CTERM

regulation of coagulationregulation of GTERM

leukotriene production during acute inflammatory response

CTERM production during GTERM

GO Term Transformations

•Stopword removaltoluene oxidation via 3-hydroxytoluene

toluene oxidation 3-hydroxytoluene

regulation of coagulationregulation coagulation

•Alphabetic reording3-hydroxytoluene oxidation toluene via

coagulation of regulation

Transformation combinations

•Abstraction=1, StopRemoval=1, Reordering=1

toluene oxidation via 3-hydroxytoluene

regulation of coagulation

leukotriene production during acute inflammatory response

Transformation combinations

Clustering

•Group together all terms with a common form after transformation

•Perform clustering for different combinations of transformations

asr {GTERM constit structu}GO:0005201 -- extracellular matrix structural constituent

GO:0005199 -- structural constituent of cell wallGO:0005213 -- structural constituent of chorionGO:0005200 -- structural constituent of cytoskeletonGO:0003735 -- structural constituent of ribosomeGO:0017056 -- structural constituent of nuclear poreGO:0019911 -- structural constituent of myelin sheath

Analysis of clusters

•Heuristic search:–Consider only clusters with abstraction (a±±)

–Identify terms in distinct a-- clusters, but merge together in a-r, as-, or asr.

•Manual assessment of 190 clusters

Transformation Impact

•25,539 source GO terms (12/2007 version)

•Pre-processing reduces to 23,478 (8%)

•a=Abstraction, s=StopRemoval, r=Reordering

•Abstraction has most impact: 46% reduction

Abstraction breakdown,a-- clusters

Distribution of cluster size

--- transformation asr transformation

True Positive clusters

•67 clusters

•317 GO terms

•Obsolete term filter: 7 clusters, 32 terms

•Approximately 77 term rephrasings anticipated

True Positive inconsistencies

•{X Y} ≈ {Y of X} | {Y in X} [45%]

{GTERM GTERM organis symbion}GO:0052387 -- induction by organism of symbiont apoptosisGO:0052351 -- induction by organism of systemic acquired resistance in symbiontGO:0052350 -- induction by organism of induced systemic resistance in symbiontGO:0052560 -- induction by organism of symbiont immune responseGO:0052399 -- induction by organism of symbiont programmed cell deathGO:0052396 -- induction by organism of symbiont non-apoptotic programmed cell death

{GTERM multice organis}GO:0010259 -- multicellular organismal agingGO:0022412 -- reproductive cellular process in multicellular organismGO:0032504 -- multicellular organism reproductionGO:0033057 -- reproductive behavior in a multicellular organismGO:0033555 -- multicellular organismal response to stressGO:0035264 -- multicellular organism growth

True Positives (2)

•Determiners [16%]

{GTERM forebra}GO:0021861 -- radial glial cell differentiation in the forebrainGO:0021846 -- cell proliferation in forebrainGO:0021872 -- generation of neurons in the forebrain

{GTERM organ}GO:0031100 -- organ regenerationGO:0035265 -- organ growthGO:0010260 -- organ senescenceGO:0001759 -- induction of an organ

True Positives (3)

•Other alternations [16%] {GTERM selecti site}

GO:0000282 -- cellular bud site selectionGO:0000918 -- selection of site for barrier septum formation

•Conflicting conventions [6%] {GTERM endothe} (partial listing)

GO:0003100 -- regulation of systemic arterial blood pressure by endothelinGO:0004962 -- endothelin receptor activity

•Punctuation [3%]GO:0016653 -- oxidoreductase activity, acting on NADH, heme protein as acceptorGO:0016658 -- oxidoreductase activity, acting on NADH, flavin as acceptorGO:0050664 -- oxidoreductase activity, acting on NADH, with oxygen as acceptor

GO:0043247 -- telomere maintenance in response to DNA damageGO:0042770 -- DNA damage response, signal transduction

True Positives (4)

•“Grab bag”–Lexical choice

•“within” vs. “in”

•“substrate-specific” vs. “substrate-dependent”

–Superfluous words like “other”

False positive breakdown

False positive cluster examples

•Semantic import of stopword [50%]

{CTERM GTERM levels modulat symbion} (partial listing)

GO:0052430 – modulation by host of symbiont RNA levelsGO:0052018 – modulation by symbiont of host RNA levels

{CTERM CTERM galacto GTERM}GO:0033580 -- protein amino acid galactosylation at cell surfaceGO:0033582 -- protein amino acid galactosylation in cytosolGO:0033579 -- protein amino acid galactosylation in endoplasmic reticulum

{callose deposit GTERM}GO:0052542 -- callose deposition during defense responseGO:0052543 -- callose deposition in cell wall

False positives (2)

•Non-parallel structure [27%]{CTERM CTERM} GO:0005204 -- chondroitin sulfate proteoglycan

GO:0006088 -- acetate to acetyl-CoAGO:0015641 -- lipoprotein toxin

{GTERM GTERM GTERM} (partial listing)

GO:0019896 -- axon transport of mitochondrionGO:0047496 -- vesicle transport along microtubuleGO:0047497 -- mitochondrion transport along microtubuleGO:0032066 -- nucleolus to nucleoplasm transportGO:0052067 -- negative regulation by symbiont of entry into host cell via phagocytosis

{GTERM storage}GO:0001506 -- neurotransmitter biosynthetic process and storageGO:0000322 -- storage vacuole

False positives (3)

•Stemming [17%]{regulat GTERM} (partial listing)

GO:0045066 -- regulatory T cell differentiationGO:0045069 -- regulation of viral genome replicationGO:0045055 -- regulated secretory pathwayGO:0031347 -- regulation of defense response

•Syntactic variation [5%]{GTERM mainten}

GO:0045216 -- intercellular junction assembly and maintenanceGO:0045217 -- intercellular junction maintenanceGO:0045218 -- zonula adherens maintenance

•Semantic import of word order[5%]{GTERM CTERM activit} {CTERM GTERM activit}

apoptosis inhibitor activity

gibberellin binding activity

Conclusions

•Used simple term transformations and heuristic search

•Able to reduce set of clusters to be manually evaluated to 190 (for 25k terms)

•Identified 67 TP instances of univocality violations covering 317 GO terms

•Future work–More specific linguistic alternations

–Improve heuristics for TP search

GO as a lexical semantic resource

• The Gene Ontology represents semantic relationships (is_a, part_of) between biological phrases representing molecular functions/processes

• Utilize the structure of the GO and lexical correspondences to infer relationships at the term level from relationships between phrases

Verspoor, C., C. Joslyn and G. Papcun (2003). "The Gene Ontology as a Source of Lexical Semantic Knowledge for a Biological Natural Language Processing Application". In Proceedings of the SIGIR'03 Workshop on Text Analysis and Search for Bioinformatics.

Inferring Lexical Relations from GO

Parallel rule:

vanillin metabolism isa aldehyde metabolism ⇒vanillin isa aldehyde

lipoprotein biosynthesis isa lipoprotein metabolism ⇒ biosynthesis isa metabolism

Modifier rule: blocking rule for modifiers

Positive gravitactic behavior isa gravitactic behavior ⇒ Ø

Larval feeding behavior (sensu insecta) isa Larval feeding behavior ⇒ ØInsertion rule: right-branching heuristic

adult feeding behavior isa adult behavior ⇒feeding behavior isa behavior

chemosensory jump behavior isa chemosensory behavior ⇒jump behavior isa behavior

Verspoor et al. (2003)

Relations inferred (with counts)

581 biosynthesis isa metabolism

577 catabolism isa metabolism

44 receptor isa binding

38 deoxyribonucleoside isa nucleoside

35 ribonucleoside isa nucleoside

33 permease isa transporter

27 Saccharomyces isa Fungi

22 porter isa transporter

15 oxidation isa metabolism

14 tRNA isa RNA

14 inhibitor isa regulator

13 ribonucleotide isa nucleotide

11 proliferation isa activation

11 differentiation isa activation

11 deoxyribonucleotide isa nucleotide

10 rRNA isa RNA

10 mRNA isa RNA

9 snRNA isa RNA

8 modification isa metabolism

8 methylation isa modification

6,364 unique relations inferred; only 70 already exist in the GO

3,270/6,589 unique labels inferred that do not occur in the GO as terms


A portion of the induced network

773 trees in the induced hierarchy

• 669 depth 2, 69 depth 3

• max depth 10, “biosynthesis”


Other uses of ontological structure

•Use of subsumption hierarchy to fill in examples for machine learning

•Establish semantic (functional) similarity–Graph distance

–Information content

•Inference, reasoning

•Constraints for information extraction (next time …)

research in the verspoor lab

Documents

gene ontology structure

gene ontology nodeprotein

ontology spacegiven

pubmed matching gene

hasgene ontology consortium

gene locat

nodes pgo

node categorization