what is an ontology
DESCRIPTION
What is an Ontology. An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common knowledge) Terms represent a controlled vocabulary, and define the concepts of a domain. - PowerPoint PPT PresentationTRANSCRIPT
What is an Ontology
• An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common knowledge)
• Terms represent a controlled vocabulary, and define the concepts of a domain.
• Terms are linked by relationships, which constitute a semantic network.
• Ontologies augment natural language annotations and can be more easily processed computationally. (becomes the language of the domain it describes for communication, coordination and collaboraton)
Why We Need Ontology in Bioinformatics
• Biologists need knowledge in order to perform their work.• Sequence comparison to infer the function.
• Biologists need knowledge for communication, but such knowledge may be represented in different ways.• Different use of gene:
• The coding region of DNA
• DNA fragment that can be transcripted and translated into a protein
• DNA region of biological interest with a name and that carries a genetic trait or phenotype
The Gene Ontology (GO)
• Provides structured vocabularies for describing gene products in the domain of molecular biology.
• Enables a common understanding of model organisms and between databases
• Consisted of three structurally unlinked hierarchies (molecular function, biological process and cellular component).
• 2 types of relationships between terms:
• is-a: subclass.
• part-of: physical part of, or subprocess of.
Why Gene Ontology?
• Without structured vocabularies, different sources can refer to the same concept using different terms (e.g., cdc54 in yeast is MCM4 in mouse).
• What is a well-known shorthand in one research community is gibberish in another. Contributions by one research community may not be recognized by others.
• Without coordination, research work may be duplicated.
• The goal of the Gene Ontology Consortium is to produce a controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.
Three GO Hierarchies
• Molecular function: elemental activity/task (what)
(e.g., DNA-binding, polymerase, transcription factor) (what a gene does at the biochemical level)
• Biological process: goal or objective (why)
(e.g., mitosis, DNA replication, cell cycle control) (A broad biological perspective – not currently a pathway)
• Cellular component: location within cellular structures and macromolecular complex (where)
(e.g., nucleus, ribosome, pre-replication complex)
(Each GO hierarchy has a DAG structure. A child term may have many parent terms)(Gene Ontology information can be accessed at http://www.geneontology.org/)
Example: Gene Ontology Hierarchy
Biological process(GO:0008150)
Behavior(GO:0007610)
Cellular process(GO:0009987)
Development(GO:0007275)
Physiological(GO:0007582)
Cell death(GO:0008219)
Cell aging(GO:0007569)
Programmed(GO:0012501)
Apoptosis(GO:0006915)
Induction(GO:0012502)
Autophagic cell death(GO:0048102)
HS response(GO:0009626)
… … …
… … …
… … … … …
…
… … … …
… … …
… … …
…
i
i i i i
i i i
P i
P i
i
P
i is a
part of
Communication(GO:0007154)
Cell growth(GO:0008151)
…i …i
Pi is-a part-of
Gene Annotation Using GO Terms
• Association of GO terms with gene products based on evidence from literature reference or computational analysis.
• The creation of GO and the association of GO terms with gene products (gene annotation) are two independent operations.
• A gene can be associated with one or more GO terms (gene categories), and one category normally has many genes (many-to-many relationship between genes and GO terms)
mouse
fly
yeast
Gene Product Associations to an Ontology
GO IDDB IDEvidence codeReference CitationNOT
IDTermDefinitionOntologySynonyms
Is-a| Part-ofNode1 IDNode2 ID
Example: Part of Molecular Function
Example: Part of Biological Process
Example: Part of Cellular Component
Genes of a Biological Process Tend to Be Co-Regulated
Gene Names BiologicalProcess
Use Gene Ontology (GO) to Annotate Genes
• GO URL: http://www.geneontology.org/
• Two concepts:
• Gene Ontology: Provides structured vocabularies for describing gene products in the domain of molecular biology (all species share the same gene ontology)
• Annotations: Association of GO terms with gene products based on evidence from literature reference or computational analysis (each species has a separate annotation file)
The Gene Ontology (GO)
• GO file: http://www.geneontology.org/ontology/gene_ontology.obo
• An example of GO term[Term]
id: GO:0000001 (A unique id for the GO term)
name: mitochondrion inheritance (The name of the GO term)
namespace: biological_process (see next slide)
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824, PMID:11389764, SGD:mcc] (A detailed description of the GO term)
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
Gene Annotation Using GO Terms
• http://www.geneontology.org/GO.current.annotations.shtml
• Select the annotation file for a particular species
• An example of an annotation entry for yeast
SGD S000004660 AAC1 GO:0005743SGD_REF:S000050955|PMID:2167309 TAS CADP/ATP translocator YMR056C gene taxon:4932
“AAC1” is the gene name“GO:0005743” is the GO id, we can link it to the corresponding item in the
ontology file“SGD_REF:S000050955|PMID:2167309” is where this annotation comes from“C” means this annotation belongs to the “cellular component” namespace“ADP/ATP translocator” is a brief description of this annotation“YMR056C” is another name for this gene“taxon:4932” means this is a yeast gene
Gene Annotation Using GO Terms
Given a list of genes L from a specific species Sj1) go to http://www.geneontology.org/GO.current.annotations.shtml
2) select and download the annotation file Fj for Sj
For each gene Gi in list L3) find the annotation entry Ek for Gi in Fj
4) find the GO term id from entry Ek
5) go to http://www.geneontology.org/ontology/gene_ontology.obo
6) find the GO term in the ontology file, the GO term provides more detailed annotation for this gene
Use of GO to Annotation Genes
Notations
Total number of genes in the data set : N
Total number of genes assigned to term T: M
Number of genes in the list: n
Number of genes in the list and assigned to term T: m
Problem: Given a list of n genes, whether they are significantly associated with a specific GO term ?
Solution: Calculate the p-Value.
How to Assess Overrepresentation of a GO Term?
Genes on an array:
Total number of genes (N): 2,285
Number of genes – cell cycle (M): 161
Genes in a cluster:
Number of genes in the cluster (n): 147
Number of genes – cell cycle (m): 25
Is the GO term (i.e., cell cycle) significantly overrepresented in the cluster?
Hyper-geometric Distribution
Given the total number of genes in the data set associated with term T is M, if randomly draw n genes from the data set N, what is the probability that m of the selected n genes will be associated with T?
n
N
mn
MN
m
M
nMNm ),,|Pr(
P-Value
Based on Hyper-geometric distribution, the probability of having m genes or fewer associated to T in N can be calculated by summing the probabilities of a random list of N genes having 1, 2, …, m genes associated to T. So the p-value of over-representation is as follows:
),min( nM
mi
n
N
in
MN
i
M
p
MAPPFinder
• A tool for mapping gene expression data to the GO hierarchies.
• Part of the free software package GenMAPP.
• Available at http://
www.genmapp.org/.
(Doniger et al., 2003)
MAPPFinder Sample Output
(Doniger et al., 2003)
GoMiner
(Zeeberg et al., 2003)
• A client-server application using Java (data on the server side).• Available at http://discover.nci.nih.gov/gominer/.
Onto-Express• A web application for GO-based microarray data
analysis (http://vortex.cs.wayne.edu/Projects.html).
• The input to Onto-Express is a list of Affymetrix probe IDs, GenBank sequence accessions or UniGene cluster IDs.
• Part of the integrated Onto-Tools, including:– Onto-Compare: compare commercial arrays.– Onto-Design: help array design (probe selection).– Onto-Translate: provide mapping of different IDs.
p GO # genes (Genes linked to poor breast cancer outcome)