The Gene Ontology project and its application to
fission yeast functional genomics data
Valerie Wood
• Introduction to the Gene Ontology (GO) project
• Data mining the fission yeast genome data
• What is GO?
(requirement, implementation) • How does it work? (annotation and ontology
development) • What can I use it for?
(applications)
•Tools for using GO for data analysis
• How can I use it? Practical exercises
Gene Ontology Why?
Traditional analysisGene 1
mRNA exportprotein phosphorylationtranscriptionmitotic cell cycle…
Gene 2
mRNA exportDNA recombinationRNA elongation (pol II)…
•requires literature searching
•time-consuming
•gene by gene basis
Not scalable!
Gene 3mRNA exporttranscription (pol II)…
Gene 1
mRNA exportprotein phosphorylationtranscriptionmitotic cell cycle…
Gene 2
mRNA exportDNA recombinationRNA elongation (pol II)…
Gene 4mRNA exporttranscription polyadenylation…
Gene 5mRNA exportRNA elongation…
Gene 6mRNA exportrRNA transcriptionDNA topological change…
Gene 5000cell cyclechromosome segregationkinetochore assemblyprotein localization…
http://www.teamtechnology.co.uk/f-scientist.jpg
Help!The problem gets biggerand bigger
and bigger!
The literature corpus
Including DNA repair gives 555
How will we ever extract all of this information?
What is the size of the ‘annotation problem’?
Fission yeast + pombe gives 8170 results
Including cell cycle gives 3467
Grouping by process
mRNA exportGene 1Gene 2Gene 3Gene 4Gene 5
transcriptionGene 1Gene 2Gene 3Gene 4Gene 5..
protein phosphorylationGene 1Gene 7Gene 10…
cell wall organization and biogenesisGene 10Gene 15Gene 18…
Cell cycleGene 1Gene 7Gene 8…
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
attacked
time
control
Puparial adhesionMolting cyclehemocyanin
Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes
Immune responseToll regulated genes
Amino acid catabolismLipid metobolism
Peptidase activityProtein catabloismImmune response
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.
GO can be used to spot patterns in thousands of genes typically obtained by functional genomics data
A controlled vocabulary
•The same phrase is used to describe different ‘entities’
•Different phrases have the same or related meanings
GO is also necessary for handling different terminology used between and within scientific communities:
late endosome to vacuole transport multivesicular body sorting
MVB sorting
late endosome to vacuole transport ; GO:0045324
Bud initiation?
Bud initiation?tooth bud initiation
cellular bud initiation flower bud initiation
http://www.geneontology.org/
• GO provides a “controlled vocabulary” for biological knowledge that can be interpreted identically both within and between genomes• Species independent, therefore enabling cross species comparisons• Provides a way to capture and represent biological knowledge in a computable form
So what is GO ?
Gene Ontology Content and structure
http://www.geneontology.org/
What is Ontology?
• Dictionary: A branch of metaphysics concerned with the nature and relations of being.
• In philosophy, the most fundamental branch of metaphysics. It studies being or existence as well as the basic categories thereof—trying to find out what entities and what types of entities exist. – Wikipedia
1606 1700s
http://www.geneontology.org/
So what does that mean?From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things.
Ontologies provide controlled, consistent vocabularies to describe concepts and relationships, thereby enabling knowledge sharing Gruber 1993
http://www.geneontology.org/
Ontology
Includes:1. A vocabulary of terms (names for
concepts)2. Definitions3. Defined logical relationships to each other
• GO divided into three parts:
• What does the gene product do?• Where and when does it act?• Why does it perform these activities?
What information might we want to capture about a gene product?What information might we want to capture about a gene product?
cellular component
biological process
molecular function
Cellular Component• where a gene product acts (location or complex)
Images from http://microscopy.fsu.edu
Molecular Function• What a gene product does (activity)
glucose-6-phosphate isomerase activity
insulin bindinginsulin receptor activity
drug transporter activity
Biological Process
Broad objective or goal
cell division
transcription
gluconeogenesis
http://www.geneontology.org/
Function (what) Process (why)
Analogy: Gene Product = hammer
Drive nail (into wood) Carpentry
Drive stake (into soil) Gardening
Smash roach Pest Control
http://www.geneontology.org/
Ontology Structure
• The Gene Ontology is structured as a directed acyclic graph (DAG)
• A DAG is similar to a hierarchy except terms can have more than one parent
• Terms can have zero, one or more children
• Terms are linked by two relationships– is-a– part-of
http://www.geneontology.org/
Parent-Child Relationships
Many-to-many parental relationship
Each child may have one or more parents
DAG: Directed Acyclic Graph
One-to-many parental relationship
Each child has only one parent
Heirarchy
Ontology Structurecell
membrane chloroplast
mitochondrial chloroplastmembrane membrane
is-apart-of
Ontology structure
• This allows the modelling of biology more realistically than a hierarchy
Ontology structure
An important feature of GO is that broader parents give rise to more specific children.When a gene is annotated to a term, it is automatically annotated to all of its parent terms
Allows curators to assign terms at different levels of granularity, depending what is known or can be inferred
gene A
http://www.geneontology.org/
True Path Rule• Every path from any term back to its top-level
parent(s) must always be true (biologically accurate), or the ontology must be revised
cell cytoplasm
chromosome nuclear chromosome cytoplasmic chromosome mitochondrial chromosome
nucleus nuclear chromosome
is-a
part-of
Anatomy of a GO termid: GO:0006094name: gluconeogenesisnamespace: processdef: The formation of glucose fromnoncarbohydrate precursors, such aspyruvate, amino acids and glycerol.exact_synonym: glucose biosynthesis
synonymhttp://cancerweb.ncl.ac.uk/ def
sourceis_a: GO:0006006is_a: GO:0006092
unique GO IDterm name
definition
parentage
ontology
http://www.geneontology.org/
No GO Areas
• GO covers ‘normal’ functions and processes– No pathological processes– No experimental conditions
• NO evolutionary relationships• NO gene products• NOT a system of nomenclature for
genes
Things to remember• A gene product may have several functions,
processes or components• Sets of functions make up a biological process• A function term refers to a reaction or activity,
NOT a gene product• …..
So how does the GO annotation happen?
*****
PMID: *****
IDA
GO:****
What type of evidence?
For each gene Read and record paper
Identify GO terms
******* GO:******* IDA PMID:*******
Submit to the GO Consortium
Annotation appears in GO database
http://www.geneontology.org/
Who uses GO?
http://www.geneontology.org/
many groups annotate, we see the results of research across species
GO:0019789SUMO ligase activity
SGDGeneDBS.pombe
pli1
nse2
pli1CST9
MMS21
SIZ1
NFI1
RGD
Pias4
Miz1
Pias3
MGI
Pias4
Pias3
Pias2
TAIR
ATSIZ1
Fission yeast GO annotation status
http://www.geneontology.org/
7519
Acetyl-CoA CoA-SH
Citrate synthase
13494
Cellular Component:
Molecular Function:
9459
TCACycle
Biological Process:
Total 30,616 annotations to 3080 terms
Data from 06/06/07
Fission yeast annotation progress
http://www.geneontology.org/
Evidence Codes used8618 IDA inferred from direct assay 776 IPI inferred from physical interaction 901 IGI inferred from genetic interaction1089 TAS traceable author statement1073 IC inferred by curator9045 ISS inferred from sequence similarity1912 IMP inferred from mutant phenotype 522 NAS non-traceable author statement6397 IEA from electronic annotation
30333
http://www.geneontology.org/
Manual Curation• Emphasis on Primary Literature (IDA, IMP, IGI, IPI) • Manual inspection of sequence similarity (ISS)
Computational Mappings (IEA)• InterPro (domain or family) to GO• UniProt (Swissprot keyword to GO)• E.C. number to GO
GO Curation Strategy
1617 PMIDs15230 annotations
5815 annotations
9569
annotations
Data from 06/06/07
http://www.geneontology.org/
1436416682
1900820108
22530
30343 30616
0
5000
10000
15000
20000
25000
30000
35000
Jan-04Mar-04May-04Jul-04Sep-04Nov-04Jan-05Mar-05May-05Jul-05Sep-05Nov-05Jan-06Mar-06May-06Jul-06Sep-06Nov-06Jan-07Mar-07May-07
date
associations
Series1Series2Series3Series4
pombe manualpombe electronicpombe totalcerevisiae total
Total 30,616 annotations to 3080 GO terms S. cerevisiae has 27662 annotations to 2971 GO term (no IEA)
Data from 06/06/07
GO Curation Progress
http://www.geneontology.org/All three aspects unknown 105 (564 S. cerevisiae)
Function 3542 (includes protein binding)
Biological Process4019
Cellular Component4821
14672679
3279(3455)
191 54
18
Total 5004 (5780 S. cerevisiae)
993
GO aspect coverage
Developing GO
Developing GO
• GO under constant development• International group of developers (all the major
model organism databases contribute)– central editorial office at EBI - 4 members
• Developed in consultation with domain experts– Term suggestions handled through online
tracking system
Adding new terms and biological concepts to the Gene Ontology
http://www.geneontology.org/
Why GO changes
• Advances in biology• New organisms join, need new terms• Fix errors and legacy terms• Improve logical consistency
• Suggestions for changes come from • the GO editors and organism curators • the user community• Analysis of logical consistency
http://www.geneontology.org/
flybase
SGD
SGD
MGI
http://www.geneontology.org/
• Provides a standard for annotation• Has 2 components the ontology and the
annotations• Allows experimental work to be evaluated in the
context of other experimental data which may be annotated at different levels of granularity
• Allows biologists to search and analyse data (particularly for identifying groups of overrepresented genes in large scale experiments)
• Becomes increasingly powerful as the ontologies and annotations are refined
• More here?
Applications
http://www.geneontology.org/
• Access gene product functional information (GeneDB, AmiGO)
• Do cross species comparison (AmiGO)
• “Slimming”, Grouping data into broad (user defined) categories to provide an overview of a geneset or genome
• “Term Enrichment” (or depletion) Provide a link between biological knowledge and functional genomics data
• Data mining
Simple- All genes annotated to a term (GeneDB, AmiGO)
Complex -Using Boolean operators (union/intersect) (GeneDB)
What can scientists do with GO?
http://www.geneontology.org/
Slimming
• High level view of GO (genes annotated to granular terms are mapped to higher level terms)
• Allows users to group genes into broader categories to assess their distribution, useful for large scale, genome wide analyses or smaller gene sets
• Different Annotation groups have created specific GO_Slims are available at GO’s FTP site
• You can create and use your own GO slim with high level terms of interest
• CARE: not a gene product count, as gene products have multiple annotations
http://www.geneontology.org/
Term Enrichment
What is it?• finds significant GO terms shared among a list of genes• discover what these genes may have in common • statistical measure of how likely your differentially regulated genes fall into
that category by chance
microarray
1000 genesexperiment
100 genes differentially regualted
mitosis – 80/100apoptosis – 40/100p. ctrl. cell prol. – 30/100glucose transp. – 20/100
How?
GO Term mapper/Onto express
http://www.geneontology.org/
Data mining, complex
A B C D E F G H I J A cell division 1018 356 224 31 49 2 271 132 - - B transcription>translat. 1367 53 66 172 0 111 47 - - C cytoskeletal/morph/vmt 842 152 32 30 78 160 - - D metabolic pathways 800 196 61 36 52 - - E mitochondrial function 732 98 47 14 - - F membrane transport 299 6 2 - - G stress 422 65 - - H signal transduction 369 - - I other 323 - J none 988
What: You can data mine the entire genome to find overlaps and intersections between terms of interest to target genes for further study
http://www.geneontology.org/
Tools
• AmiGO• GO Slim Mapper- maps the specific, granular
GO terms used to annotate a list of gene products to corresponding more general parent GO slim terms.
• GO Term Finder- searches for significant shared GO terms, or parents of the GO terms, used to annotate gene products in a given list.
• GeneDB Boolean querying
http://www.geneontology.org/
Acknowledgements
http://www.geneontology.org/
• A gene product can have several functions, cellular locations and be involved in many processes
• Annotation of a gene product to one ontology is independent from its annotation to other ontologies
• Annotations are only to terms reflecting a normal activity or location
• Usage of ‘unknown’ GO terms
Additional points
http://www.geneontology.org/
1. NOT• a gene product is NOT associated with the GO term • to document conflicting claims in the literature.
2. Contributes to• distinguishes between individual subunit functions and
whole complex functions• used with GO Function Ontology
3. Colocalizes with• transiently or peripherally associated with an organelle
or complex • used with GO Component Ontology
Modifying the interpretation of an annotation: the
Qualifier column
http://www.geneontology.org/
Fatty acid biosynthesis (Swiss-Prot Keyword)
EC:6.4.1.2 (EC number)
IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry)
GO:Fatty acid biosynthesis
(GO:0006633)
GO:acetyl-CoA carboxylase activity
(GO:0003989)
GO:acetyl-CoA carboxylaseactivity
(GO:0003989)
Electronic Annotations
http://www.geneontology.org/
Unknown v.s. Unannotated• “Unknown” is used when the curator has
determined that there is no existing literature to support an annotation.– Biological process unknown GO:0000004– Molecular function unknown GO:0005554– Cellular component unknown GO:0008372
• NOT the same as having no annotation at all – No annotation means that no one has looked yet