research in the verspoor lab
DESCRIPTION
Research in the Verspoor Lab. Generally speaking…. Focus on analysis of the biomedical literature For the purpose of: Turning unstructured data (natural language text) into structured statements Taking advantage of the wealth of information in the literature for biological data analysis - PowerPoint PPT PresentationTRANSCRIPT
Karin Verspoor, Ph.D.Faculty, Computational Bioscience ProgramUniversity of Colorado School of Medicine
[email protected]://compbio.ucdenver.edu/Hunter_lab/Verspoor
Research in the Verspoor Lab
Generally speaking…
•Focus on analysis of the biomedical literature
•For the purpose of:–Turning unstructured data (natural
language text) into structured statements
–Taking advantage of the wealth of information in the literature for biological data analysis
•Using (analyzing, building) semantic resources for the biomedical domain
Today: Focus on Ontologies
•Use of the structure of ontologies to understand relations among protein annotations
•Analysis of the term structure of ontologies
•Particular ontology of interest: Gene Ontology
Gene Ontology (GO)
•Taxonomic controlled vocabulary
•~ 16K nodes PGO populated by genes, proteins
•Two orders on PGO: ≤isa,≤has Gene Ontology Consortium (2000): “Gene Ontology: Tool
For the Unification of Biology”, Nature Genetics, 25:25-29
The Gene Ontology: Usage
•33703 terms–20403 biological_process
–2810 cellular_component
–8996 molecular_function
•Gene Annotations for 40+ organisms
•3504 publications in PubMed matching “gene ontology” (3/8/11)
• ISI Web of Knowledge: 5371 refs to GO paper Graph statistics as of June 9, 2009
Protein Function Prediction
•Verspoor, K., Cohn, J., Mniszewski, S., and Joslyn, C. (2006). A Categorization Approach to Automated Ontological Function Annotation. Protein Science, v.15, pp.1544-1549.
Automated Protein Function Annotation
•Mappings –From regions of sequence,
structure, keyword spaces
– Into regions of biological function space: •taxonomic bio-ontologies of
molecular function
•Characterize formal structure of bio-ontologies:– Order theoretical approaches
– Combinatorial algorithms
POSOLE: POSet Ontology Laboratory Environment
• POSOLE: a general environment for ontology experimentation– Graph representation of an ontology as a POSet
– POSet statistics analysis (e.g. depth, width, average rank)
– Algorithms for node categorization utilizing the structure of the ontology
• First Deployment: Ontology categorization for automated protein function annotation– Function: Gene Ontology node
– Protein: target sequence or Swiss-Prot identifier
– Map proteins to sets of potential Gene Ontology nodes
– Ontology categorization: “clustering” nodes in ontology space to identify the most likely node assignment
• Dual Queries: Text and sequence neighborhoods
POSOLE strategy
•Function Prediction as Categorization of Nearest Neighbors
•Application of POSOC categorization methodology utilizing the Gene Ontology structure to find the best covering nodes given a set of node “hits”
•“Hits” are based on (application-dependent) mappings from neighbors of an input protein to Gene Ontology nodes
•Covering nodes are function annotation predictions
• PosoleRun, core of each application– Load the graph (GO)
– Build a query, a set of query items
– Categorize the query items
• Each application defines its own QueryBuilder
POSOLE architecture
Categorization Task: POSOC“Cluster” Genes in Ontology Space
• Given the Gene Ontology (GO) . . . And mappings to GO nodes . . .
• “Splatter” them over the GO . . . Where do they end up?– Concentrated? -- Dispersed?– Clustered? -- High or low?– Overlapping or distinct?
• Pseudo-distances between comparable nodes to measure vertical separation
• POSOC traverses the structure of the GO, percolating hits upwards, and calculating scores for GO nodes.
• Scores to rank-order nodes with respect to gene locations, balancing:
– Coverage: Covering as many genes as possible
– Specificity: But at the “lowest level” possible• “Cluster” based on non-comparable high score nodes
http://www.c3.lanl.gov/posoc/
Joslyn, Cliff; Mniszewski, Susan; Fulmer, Andy; and Heaton, Gary: (2004) “The Gene Ontology Categorizer”, Bioinformatics, v. 20:s1, pp. 169-177
Order Theoretical Categorization Method
• Represent GO as labeled, finite ordered set
• Given labels (genes) c, e, i . . .
• What node(s) A,B, C, . . . ,K are best to attend to?
– C– {H, J}
– {A, H, J}
POSOLE applications
Application: BioCreAtIvE I, Task 2
Critical Assessment of Information Extraction in Biology
• Automatic assignment of Gene Ontology annotations to human proteins based on a journal publication– Given a Swiss-Prot/TrEMBL protein ID and a document, predict
a GO node to which the protein should be annotated
– Also return the evidence text from the document supporting the annotation
• Strategy: Annotation as Categorization of Document Neighborhood
• Application of POSOC categorization utilizing the Gene Ontology structure to find the best covering nodes given a set of node “hits”
• “Hits” in this case are based on overlaps between input terms and GO node terms (in labels, definitions)
POSOC as applied to context terms
• Collect all terms in a context window of n sentences around any reference to the protein of interest
• Transform an input query into a set of node hits:– Morphologically normalize GO node labels
– Look for any overlaps between input terms and terms in the normalized node labels
– An overlap = a node hit, with strength based on the input weight of the term (from TFIDF)
– Multiple overlaps on a given node count as multiple hits
• POSOC returns a set of GO nodes representing cluster heads for weighted term input set, and data on which input terms contributed to the selection of each cluster head: Annotation predictions
BioLASER:Los Alamos Semantic Event Recognizer for Biology
• Text analysis environment:– Relation
extraction
– Term vector analysis
• Domain-specific and application-specific components
• Markup workflow implementation
Application: CASP-6 Function Prediction
Critical Assessment of Structure Prediction evaluationFunction Prediction subtask
•Automatic assignment of Gene Ontology annotations to target protein sequences
•Strategy: Annotation as Categorization of Sequence Neighborhood
•Application of POSOC categorization utilizing the Gene Ontology structure to find the best covering nodes given a set of node “hits”
•“Hits” in this case are based on known mappings from proteins in the sequence neighborhood of the target to Gene Ontology nodes
CASP architecture
CASP Evaluation• Test set
– proteins with known Gene Ontology mappings
– 4530 SwissProt protein sequences associated from PDB
– Protein to GO Mappings derived from UniProt
• Eliminate PSI-BLAST identity matches from mappings used in prediction
– Matches to protein with the same SwissProt Accession ID
– Matches to protein with an accession ID that maps to the same SwissProt Entry ID
– Matches to protein with an e-value < 10-130 or e-value < max e-value for known identity match
• Goal: compare function predictions made by the system with known functions assigned to each input protein
CASP Evaluation runs
• Baseline Best Blast: Predictions are the GO nodes associated with non-identical protein scoring highest in the PSI-BLAST analysis. All predicted GO nodes are considered to be at rank 1.
• Baseline Full Neighborhood: Predictions are the GO nodes associated with all proteins matched in the PSI-BLAST analysis (with evalue < 10). The predictions are ranked according to the evalue of the corresponding PSI-BLAST match.
• POSOC Best Blast: Inputs to POSOC are the GO nodes associated with non-identical protein scoring highest in the PSI-BLAST analysis, weighted by evalue of the match. POSOC categorizes and ranks these inputs to produce the predictions.
• POSOC Full Neighborhood: Inputs to are the GO nodes associated with all proteins matched in the PSI-BLAST analysis, weighted by evalue of the match.
POSOC categorizes and ranks these inputs to produce the predictions.
POSOC: Full Neighborhood
POSOC: Best Blast
Baseline:Full Neighborhood
Baseline:Best Blast
Evaluation analysis
•Precision/Recall–Precision = % of predictions that are correct
–Recall = % of known predictions that are recovered
•Extension to ranked list of predictions–Consider precision/recall at different ranks
Ontological Distance Metrics
• How “far apart” are p and q?
• Genealogical approach:
• Radius 0: Equals: Direct match
• Radius 1: Nuclear family: Parents, children, siblings
• Radius 2: Extended family: grandparents, grandchildren, cousins, aunts/uncles, nieces/nephews
Evaluation results: Precision
Evaluation results: Recall
Evaluation of Ontological predictions
•Extension to ontological predictions: when does a GO node p in F(x) count as a “match” against a q in G(x)?– What about siblings? Ancestors?
– Partial credit?•Based on proximity
•Based on specificity
•Adapt hierarchical precision/recall measure from Kiritchenko et al 2005
Hierarchical Precision vs. Rank
(Cellular Component branch)
Hierarchical Precision vs. Rank
(Molecular Function branch)
Hierarchical Precision vs. Rank
(Biological Process branch)
Summary: Protein Function
Prediction• We have constructed the POSOLE architecture, supporting integration of mappings from different spaces into function space
• We utilize the mathematical structure of function space as defined by the Gene Ontology to help identify commonalities and “clusters”, as well as in evaluation
• We have proposed an extension to Kiritchenko et al’s hierarchical precision/recall measure to support comparison of sets of predictions and answers
• The results on CASP function prediction show the promise of the POSOLE and POSOC technologies for automated annotation of protein sequences.
Ontology Quality Assurance
•Verspoor, K., Dvorkin, D., Cohen, K.B., Hunter, L. (2009) Ontology quality assurance through analysis of term transformations. Bioinformatics 25(12):i77-i84.
Regulation of transcription
Transcription Regulation
Positive regulation of cell migration
Cell migration positive regulation
Key quality concern: Univocality
•Univocality = one voice (Spinoza, 1677)“a shared interpretation of the nature of
reality”(with thanks to David Hill @ Jackson Lab)
•Consistency of expression of concepts
•Regular, compositional, linguistic structure–Facilitates human usability
–Computational tools can utilize this regularity
Quality Assurance in the GO
•Goal: identify violations of univocality
•Problem: the GO is generally very high quality; how to identify the few inconsistencies?
•Hypothesis: violations of univocality will correspond to transformational variants
•Strategy: term transformation & clustering
GO Term Transformation:Abstraction
•Substitution of embedded GO & ChEBI termstoluene oxidation via 3-hydroxytoluene
CTERM oxidation via CTERM
regulation of coagulationregulation of GTERM
leukotriene production during acute inflammatory response
CTERM production during GTERM
GO Term Transformations
•Stopword removaltoluene oxidation via 3-hydroxytoluene
toluene oxidation 3-hydroxytoluene
regulation of coagulationregulation coagulation
•Alphabetic reording3-hydroxytoluene oxidation toluene via
coagulation of regulation
Transformation combinations
•Abstraction=1, StopRemoval=1, Reordering=1
toluene oxidation via 3-hydroxytoluene
regulation of coagulation
leukotriene production during acute inflammatory response
Transformation combinations
Clustering
•Group together all terms with a common form after transformation
•Perform clustering for different combinations of transformations
asr {GTERM constit structu}GO:0005201 -- extracellular matrix structural constituent
GO:0005199 -- structural constituent of cell wallGO:0005213 -- structural constituent of chorionGO:0005200 -- structural constituent of cytoskeletonGO:0003735 -- structural constituent of ribosomeGO:0017056 -- structural constituent of nuclear poreGO:0019911 -- structural constituent of myelin sheath
Analysis of clusters
•Heuristic search:–Consider only clusters with abstraction (a±±)
–Identify terms in distinct a-- clusters, but merge together in a-r, as-, or asr.
•Manual assessment of 190 clusters
Transformation Impact
•25,539 source GO terms (12/2007 version)
•Pre-processing reduces to 23,478 (8%)
•a=Abstraction, s=StopRemoval, r=Reordering
•Abstraction has most impact: 46% reduction
Abstraction breakdown,a-- clusters
Distribution of cluster size
--- transformation asr transformation
True Positive clusters
•67 clusters
•317 GO terms
•Obsolete term filter: 7 clusters, 32 terms
•Approximately 77 term rephrasings anticipated
True Positive inconsistencies
•{X Y} ≈ {Y of X} | {Y in X} [45%]
{GTERM GTERM organis symbion}GO:0052387 -- induction by organism of symbiont apoptosisGO:0052351 -- induction by organism of systemic acquired resistance in symbiontGO:0052350 -- induction by organism of induced systemic resistance in symbiontGO:0052560 -- induction by organism of symbiont immune responseGO:0052399 -- induction by organism of symbiont programmed cell deathGO:0052396 -- induction by organism of symbiont non-apoptotic programmed cell death
{GTERM multice organis}GO:0010259 -- multicellular organismal agingGO:0022412 -- reproductive cellular process in multicellular organismGO:0032504 -- multicellular organism reproductionGO:0033057 -- reproductive behavior in a multicellular organismGO:0033555 -- multicellular organismal response to stressGO:0035264 -- multicellular organism growth
True Positives (2)
•Determiners [16%]
{GTERM forebra}GO:0021861 -- radial glial cell differentiation in the forebrainGO:0021846 -- cell proliferation in forebrainGO:0021872 -- generation of neurons in the forebrain
{GTERM organ}GO:0031100 -- organ regenerationGO:0035265 -- organ growthGO:0010260 -- organ senescenceGO:0001759 -- induction of an organ
True Positives (3)
•Other alternations [16%] {GTERM selecti site}
GO:0000282 -- cellular bud site selectionGO:0000918 -- selection of site for barrier septum formation
•Conflicting conventions [6%] {GTERM endothe} (partial listing)
GO:0003100 -- regulation of systemic arterial blood pressure by endothelinGO:0004962 -- endothelin receptor activity
•Punctuation [3%]GO:0016653 -- oxidoreductase activity, acting on NADH, heme protein as acceptorGO:0016658 -- oxidoreductase activity, acting on NADH, flavin as acceptorGO:0050664 -- oxidoreductase activity, acting on NADH, with oxygen as acceptor
GO:0043247 -- telomere maintenance in response to DNA damageGO:0042770 -- DNA damage response, signal transduction
True Positives (4)
•“Grab bag”–Lexical choice
•“within” vs. “in”
•“substrate-specific” vs. “substrate-dependent”
–Superfluous words like “other”
False positive breakdown
False positive cluster examples
•Semantic import of stopword [50%]
{CTERM GTERM levels modulat symbion} (partial listing)
GO:0052430 – modulation by host of symbiont RNA levelsGO:0052018 – modulation by symbiont of host RNA levels
{CTERM CTERM galacto GTERM}GO:0033580 -- protein amino acid galactosylation at cell surfaceGO:0033582 -- protein amino acid galactosylation in cytosolGO:0033579 -- protein amino acid galactosylation in endoplasmic reticulum
{callose deposit GTERM}GO:0052542 -- callose deposition during defense responseGO:0052543 -- callose deposition in cell wall
False positives (2)
•Non-parallel structure [27%]{CTERM CTERM} GO:0005204 -- chondroitin sulfate proteoglycan
GO:0006088 -- acetate to acetyl-CoAGO:0015641 -- lipoprotein toxin
{GTERM GTERM GTERM} (partial listing)
GO:0019896 -- axon transport of mitochondrionGO:0047496 -- vesicle transport along microtubuleGO:0047497 -- mitochondrion transport along microtubuleGO:0032066 -- nucleolus to nucleoplasm transportGO:0052067 -- negative regulation by symbiont of entry into host cell via phagocytosis
{GTERM storage}GO:0001506 -- neurotransmitter biosynthetic process and storageGO:0000322 -- storage vacuole
False positives (3)
•Stemming [17%]{regulat GTERM} (partial listing)
GO:0045066 -- regulatory T cell differentiationGO:0045069 -- regulation of viral genome replicationGO:0045055 -- regulated secretory pathwayGO:0031347 -- regulation of defense response
•Syntactic variation [5%]{GTERM mainten}
GO:0045216 -- intercellular junction assembly and maintenanceGO:0045217 -- intercellular junction maintenanceGO:0045218 -- zonula adherens maintenance
•Semantic import of word order[5%]{GTERM CTERM activit} {CTERM GTERM activit}
apoptosis inhibitor activity
gibberellin binding activity
Conclusions
•Used simple term transformations and heuristic search
•Able to reduce set of clusters to be manually evaluated to 190 (for 25k terms)
•Identified 67 TP instances of univocality violations covering 317 GO terms
•Future work–More specific linguistic alternations
–Improve heuristics for TP search
GO as a lexical semantic resource
• The Gene Ontology represents semantic relationships (is_a, part_of) between biological phrases representing molecular functions/processes
• Utilize the structure of the GO and lexical correspondences to infer relationships at the term level from relationships between phrases
Verspoor, C., C. Joslyn and G. Papcun (2003). "The Gene Ontology as a Source of Lexical Semantic Knowledge for a Biological Natural Language Processing Application". In Proceedings of the SIGIR'03 Workshop on Text Analysis and Search for Bioinformatics.
Inferring Lexical Relations from GO
Parallel rule:
vanillin metabolism isa aldehyde metabolism ⇒vanillin isa aldehyde
lipoprotein biosynthesis isa lipoprotein metabolism ⇒ biosynthesis isa metabolism
Modifier rule: blocking rule for modifiers
Positive gravitactic behavior isa gravitactic behavior ⇒ Ø
Larval feeding behavior (sensu insecta) isa Larval feeding behavior ⇒ ØInsertion rule: right-branching heuristic
adult feeding behavior isa adult behavior ⇒feeding behavior isa behavior
chemosensory jump behavior isa chemosensory behavior ⇒jump behavior isa behavior
Verspoor et al. (2003)
Relations inferred (with counts)
581 biosynthesis isa metabolism
577 catabolism isa metabolism
44 receptor isa binding
38 deoxyribonucleoside isa nucleoside
35 ribonucleoside isa nucleoside
33 permease isa transporter
27 Saccharomyces isa Fungi
22 porter isa transporter
15 oxidation isa metabolism
14 tRNA isa RNA
14 inhibitor isa regulator
13 ribonucleotide isa nucleotide
11 proliferation isa activation
11 differentiation isa activation
11 deoxyribonucleotide isa nucleotide
10 rRNA isa RNA
10 mRNA isa RNA
9 snRNA isa RNA
8 modification isa metabolism
8 methylation isa modification
6,364 unique relations inferred; only 70 already exist in the GO
3,270/6,589 unique labels inferred that do not occur in the GO as terms
Verspoor et al. (2003)
A portion of the induced network
773 trees in the induced hierarchy
• 669 depth 2, 69 depth 3
• max depth 10, “biosynthesis”
Verspoor et al. (2003)
Other uses of ontological structure
•Use of subsumption hierarchy to fill in examples for machine learning
•Establish semantic (functional) similarity–Graph distance
–Information content
•Inference, reasoning
•Constraints for information extraction (next time …)