what is an ontology

25
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common knowledge) Terms represent a controlled vocabulary, and define the concepts of a domain. Terms are linked by relationships, which constitute a semantic network. Ontologies augment natural language annotations and can be more easily processed computationally. (becomes the language of the domain it

Upload: maxwell-jordan

Post on 31-Dec-2015

59 views

Category:

Documents


1 download

DESCRIPTION

What is an Ontology. An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common knowledge) Terms represent a controlled vocabulary, and define the concepts of a domain. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: What is an Ontology

What is an Ontology

• An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common knowledge)

• Terms represent a controlled vocabulary, and define the concepts of a domain.

• Terms are linked by relationships, which constitute a semantic network.

• Ontologies augment natural language annotations and can be more easily processed computationally. (becomes the language of the domain it describes for communication, coordination and collaboraton)

Page 2: What is an Ontology

Why We Need Ontology in Bioinformatics

• Biologists need knowledge in order to perform their work.• Sequence comparison to infer the function.

• Biologists need knowledge for communication, but such knowledge may be represented in different ways.• Different use of gene:

• The coding region of DNA

• DNA fragment that can be transcripted and translated into a protein

• DNA region of biological interest with a name and that carries a genetic trait or phenotype

Page 3: What is an Ontology

The Gene Ontology (GO)

• Provides structured vocabularies for describing gene products in the domain of molecular biology.

• Enables a common understanding of model organisms and between databases

• Consisted of three structurally unlinked hierarchies (molecular function, biological process and cellular component).

• 2 types of relationships between terms:

• is-a: subclass.

• part-of: physical part of, or subprocess of.

Page 4: What is an Ontology

Why Gene Ontology?

• Without structured vocabularies, different sources can refer to the same concept using different terms (e.g., cdc54 in yeast is MCM4 in mouse).

• What is a well-known shorthand in one research community is gibberish in another. Contributions by one research community may not be recognized by others.

• Without coordination, research work may be duplicated.

• The goal of the Gene Ontology Consortium is to produce a controlled vocabulary that can be applied to all eukaryotes even as knowledge of gene and protein roles in cells is accumulating and changing.

Page 5: What is an Ontology

Three GO Hierarchies

• Molecular function: elemental activity/task (what)

(e.g., DNA-binding, polymerase, transcription factor) (what a gene does at the biochemical level)

• Biological process: goal or objective (why)

(e.g., mitosis, DNA replication, cell cycle control) (A broad biological perspective – not currently a pathway)

• Cellular component: location within cellular structures and macromolecular complex (where)

(e.g., nucleus, ribosome, pre-replication complex)

(Each GO hierarchy has a DAG structure. A child term may have many parent terms)(Gene Ontology information can be accessed at http://www.geneontology.org/)

Page 6: What is an Ontology

Example: Gene Ontology Hierarchy

Biological process(GO:0008150)

Behavior(GO:0007610)

Cellular process(GO:0009987)

Development(GO:0007275)

Physiological(GO:0007582)

Cell death(GO:0008219)

Cell aging(GO:0007569)

Programmed(GO:0012501)

Apoptosis(GO:0006915)

Induction(GO:0012502)

Autophagic cell death(GO:0048102)

HS response(GO:0009626)

… … …

… … …

… … … … …

… … … …

… … …

… … …

i

i i i i

i i i

P i

P i

i

P

i is a

part of

Communication(GO:0007154)

Cell growth(GO:0008151)

…i …i

Page 7: What is an Ontology

Pi is-a part-of

Page 8: What is an Ontology

Gene Annotation Using GO Terms

• Association of GO terms with gene products based on evidence from literature reference or computational analysis.

• The creation of GO and the association of GO terms with gene products (gene annotation) are two independent operations.

• A gene can be associated with one or more GO terms (gene categories), and one category normally has many genes (many-to-many relationship between genes and GO terms)

Page 9: What is an Ontology

mouse

fly

yeast

Gene Product Associations to an Ontology

GO IDDB IDEvidence codeReference CitationNOT

IDTermDefinitionOntologySynonyms

Is-a| Part-ofNode1 IDNode2 ID

Page 10: What is an Ontology

Example: Part of Molecular Function

Page 11: What is an Ontology

Example: Part of Biological Process

Page 12: What is an Ontology

Example: Part of Cellular Component

Page 13: What is an Ontology

Genes of a Biological Process Tend to Be Co-Regulated

Gene Names BiologicalProcess

Page 14: What is an Ontology

Use Gene Ontology (GO) to Annotate Genes

• GO URL: http://www.geneontology.org/

• Two concepts:

• Gene Ontology: Provides structured vocabularies for describing gene products in the domain of molecular biology (all species share the same gene ontology)

• Annotations: Association of GO terms with gene products based on evidence from literature reference or computational analysis (each species has a separate annotation file)

Page 15: What is an Ontology

The Gene Ontology (GO)

• GO file: http://www.geneontology.org/ontology/gene_ontology.obo

• An example of GO term[Term]

id: GO:0000001 (A unique id for the GO term)

name: mitochondrion inheritance (The name of the GO term)

namespace: biological_process (see next slide)

def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [PMID:10873824, PMID:11389764, SGD:mcc] (A detailed description of the GO term)

is_a: GO:0048308 ! organelle inheritance

is_a: GO:0048311 ! mitochondrion distribution

Page 16: What is an Ontology

Gene Annotation Using GO Terms

• http://www.geneontology.org/GO.current.annotations.shtml

• Select the annotation file for a particular species

• An example of an annotation entry for yeast

SGD S000004660 AAC1 GO:0005743SGD_REF:S000050955|PMID:2167309 TAS CADP/ATP translocator YMR056C gene taxon:4932

“AAC1” is the gene name“GO:0005743” is the GO id, we can link it to the corresponding item in the

ontology file“SGD_REF:S000050955|PMID:2167309” is where this annotation comes from“C” means this annotation belongs to the “cellular component” namespace“ADP/ATP translocator” is a brief description of this annotation“YMR056C” is another name for this gene“taxon:4932” means this is a yeast gene

Page 17: What is an Ontology

Gene Annotation Using GO Terms

Given a list of genes L from a specific species Sj1) go to http://www.geneontology.org/GO.current.annotations.shtml

2) select and download the annotation file Fj for Sj

For each gene Gi in list L3) find the annotation entry Ek for Gi in Fj

4) find the GO term id from entry Ek

5) go to http://www.geneontology.org/ontology/gene_ontology.obo

6) find the GO term in the ontology file, the GO term provides more detailed annotation for this gene

Page 18: What is an Ontology

Use of GO to Annotation Genes

Notations

Total number of genes in the data set : N

Total number of genes assigned to term T: M

Number of genes in the list: n

Number of genes in the list and assigned to term T: m

Problem: Given a list of n genes, whether they are significantly associated with a specific GO term ?

Solution: Calculate the p-Value.

Page 19: What is an Ontology

How to Assess Overrepresentation of a GO Term?

Genes on an array:

Total number of genes (N): 2,285

Number of genes – cell cycle (M): 161

Genes in a cluster:

Number of genes in the cluster (n): 147

Number of genes – cell cycle (m): 25

Is the GO term (i.e., cell cycle) significantly overrepresented in the cluster?

Page 20: What is an Ontology

Hyper-geometric Distribution

Given the total number of genes in the data set associated with term T is M, if randomly draw n genes from the data set N, what is the probability that m of the selected n genes will be associated with T?

n

N

mn

MN

m

M

nMNm ),,|Pr(

Page 21: What is an Ontology

P-Value

Based on Hyper-geometric distribution, the probability of having m genes or fewer associated to T in N can be calculated by summing the probabilities of a random list of N genes having 1, 2, …, m genes associated to T. So the p-value of over-representation is as follows:

),min( nM

mi

n

N

in

MN

i

M

p

Page 22: What is an Ontology

MAPPFinder

• A tool for mapping gene expression data to the GO hierarchies.

• Part of the free software package GenMAPP.

• Available at http://

www.genmapp.org/.

(Doniger et al., 2003)

Page 23: What is an Ontology

MAPPFinder Sample Output

(Doniger et al., 2003)

Page 24: What is an Ontology

GoMiner

(Zeeberg et al., 2003)

• A client-server application using Java (data on the server side).• Available at http://discover.nci.nih.gov/gominer/.

Page 25: What is an Ontology

Onto-Express• A web application for GO-based microarray data

analysis (http://vortex.cs.wayne.edu/Projects.html).

• The input to Onto-Express is a list of Affymetrix probe IDs, GenBank sequence accessions or UniGene cluster IDs.

• Part of the integrated Onto-Tools, including:– Onto-Compare: compare commercial arrays.– Onto-Design: help array design (probe selection).– Onto-Translate: provide mapping of different IDs.

p GO # genes (Genes linked to poor breast cancer outcome)