the gene ontology project and its application to fission yeast functional genomics data

Post on 30-Dec-2015

23 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

The Gene Ontology project and its application to fission yeast functional genomics data. Valerie Wood. Introduction to the Gene Ontology (GO) project. What is GO? (requirement, implementation). How does it work? (annotation and ontology development). - PowerPoint PPT Presentation

TRANSCRIPT

The Gene Ontology project and its application to

fission yeast functional genomics data

Valerie Wood

• Introduction to the Gene Ontology (GO) project

• Data mining the fission yeast genome data

• What is GO?

(requirement, implementation) • How does it work? (annotation and ontology

development) • What can I use it for?

(applications)

•Tools for using GO for data analysis

• How can I use it? Practical exercises

Gene Ontology Why?

Traditional analysisGene 1

mRNA exportprotein phosphorylationtranscriptionmitotic cell cycle…

Gene 2

mRNA exportDNA recombinationRNA elongation (pol II)…

•requires literature searching

•time-consuming

•gene by gene basis

Not scalable!

Gene 3mRNA exporttranscription (pol II)…

Gene 1

mRNA exportprotein phosphorylationtranscriptionmitotic cell cycle…

Gene 2

mRNA exportDNA recombinationRNA elongation (pol II)…

Gene 4mRNA exporttranscription polyadenylation…

Gene 5mRNA exportRNA elongation…

Gene 6mRNA exportrRNA transcriptionDNA topological change…

Gene 5000cell cyclechromosome segregationkinetochore assemblyprotein localization…

http://www.teamtechnology.co.uk/f-scientist.jpg

Help!The problem gets biggerand bigger

and bigger!

The literature corpus

Including DNA repair gives 555

How will we ever extract all of this information?

What is the size of the ‘annotation problem’?

Fission yeast + pombe gives 8170 results

Including cell cycle gives 3467

Grouping by process

mRNA exportGene 1Gene 2Gene 3Gene 4Gene 5

transcriptionGene 1Gene 2Gene 3Gene 4Gene 5..

protein phosphorylationGene 1Gene 7Gene 10…

cell wall organization and biogenesisGene 10Gene 15Gene 18…

Cell cycleGene 1Gene 7Gene 8…

Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...

Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)

attacked

time

control

Puparial adhesionMolting cyclehemocyanin

Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes

Immune responseToll regulated genes

Amino acid catabolismLipid metobolism

Peptidase activityProtein catabloismImmune response

Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...

Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)

Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.

GO can be used to spot patterns in thousands of genes typically obtained by functional genomics data

A controlled vocabulary

•The same phrase is used to describe different ‘entities’

•Different phrases have the same or related meanings

GO is also necessary for handling different terminology used between and within scientific communities:

late endosome to vacuole transport multivesicular body sorting

MVB sorting

late endosome to vacuole transport ; GO:0045324

Bud initiation?

Bud initiation?tooth bud initiation

cellular bud initiation flower bud initiation

http://www.geneontology.org/

• GO provides a “controlled vocabulary” for biological knowledge that can be interpreted identically both within and between genomes• Species independent, therefore enabling cross species comparisons• Provides a way to capture and represent biological knowledge in a computable form

So what is GO ?

Gene Ontology Content and structure

http://www.geneontology.org/

What is Ontology?

• Dictionary: A branch of metaphysics concerned with the nature and relations of being.

• In philosophy, the most fundamental branch of metaphysics. It studies being or existence as well as the basic categories thereof—trying to find out what entities and what types of entities exist. – Wikipedia

1606 1700s

http://www.geneontology.org/

So what does that mean?From a practical view, ontology is the representation of something we know about. “Ontologies" consist of a representation of things, that are detectable or directly observable, and the relationships between those things.

Ontologies provide controlled, consistent vocabularies to describe concepts and relationships, thereby enabling knowledge sharing Gruber 1993

http://www.geneontology.org/

Ontology

Includes:1. A vocabulary of terms (names for

concepts)2. Definitions3. Defined logical relationships to each other

• GO divided into three parts:

• What does the gene product do?• Where and when does it act?• Why does it perform these activities?

What information might we want to capture about a gene product?What information might we want to capture about a gene product?

cellular component

biological process

molecular function

Cellular Component• where a gene product acts (location or complex)

Images from http://microscopy.fsu.edu

Molecular Function• What a gene product does (activity)

glucose-6-phosphate isomerase activity

insulin bindinginsulin receptor activity

drug transporter activity

Biological Process

Broad objective or goal

cell division

transcription

gluconeogenesis

http://www.geneontology.org/

Function (what) Process (why)

Analogy: Gene Product = hammer

Drive nail (into wood) Carpentry

Drive stake (into soil) Gardening

Smash roach Pest Control

http://www.geneontology.org/

Ontology Structure

• The Gene Ontology is structured as a directed acyclic graph (DAG)

• A DAG is similar to a hierarchy except terms can have more than one parent

• Terms can have zero, one or more children

• Terms are linked by two relationships– is-a– part-of

http://www.geneontology.org/

Parent-Child Relationships

Many-to-many parental relationship

Each child may have one or more parents

DAG: Directed Acyclic Graph

One-to-many parental relationship

Each child has only one parent

Heirarchy

Ontology Structurecell

membrane chloroplast

mitochondrial chloroplastmembrane membrane

is-apart-of

Ontology structure

• This allows the modelling of biology more realistically than a hierarchy

Ontology structure

An important feature of GO is that broader parents give rise to more specific children.When a gene is annotated to a term, it is automatically annotated to all of its parent terms

Allows curators to assign terms at different levels of granularity, depending what is known or can be inferred

gene A

http://www.geneontology.org/

True Path Rule• Every path from any term back to its top-level

parent(s) must always be true (biologically accurate), or the ontology must be revised

cell cytoplasm

chromosome nuclear chromosome cytoplasmic chromosome mitochondrial chromosome

nucleus nuclear chromosome

is-a

part-of

Anatomy of a GO termid: GO:0006094name: gluconeogenesisnamespace: processdef: The formation of glucose fromnoncarbohydrate precursors, such aspyruvate, amino acids and glycerol.exact_synonym: glucose biosynthesis

synonymhttp://cancerweb.ncl.ac.uk/ def

sourceis_a: GO:0006006is_a: GO:0006092

unique GO IDterm name

definition

parentage

ontology

http://www.geneontology.org/

No GO Areas

• GO covers ‘normal’ functions and processes– No pathological processes– No experimental conditions

• NO evolutionary relationships• A function term refers to a reaction or activity,

NOT a gene product

• NOT a system of nomenclature for genes

So how does the GO annotation happen?

*****

PMID: *****

IDA

GO:****

What type of evidence?

For each gene Read and record paper

Identify GO terms

******* GO:******* IDA PMID:*******

Submit to the GO Consortium

Annotation appears in GO database

http://www.geneontology.org/

Who uses GO?

http://www.geneontology.org/

http://www.geneontology.org

http://www.geneontology.org/

many groups annotate, we see the results of research across species

GO:0019789SUMO ligase activity

SGDGeneDBS.pombe

pli1

nse2

pli1CST9

MMS21

SIZ1

NFI1

RGD

Pias4

Miz1

Pias3

MGI

Pias4

Pias3

Pias2

TAIR

ATSIZ1

Fission yeast GO annotation status

http://www.geneontology.org/

7519

Acetyl-CoA CoA-SH

Citrate synthase

13494

Cellular Component:

Molecular Function:

9459

TCACycle

Biological Process:

Total 30,616 annotations to 3080 terms

Data from 06/06/07

Fission yeast annotation progress

http://www.geneontology.org/

Evidence Codes used8618 IDA inferred from direct assay 776 IPI inferred from physical interaction 901 IGI inferred from genetic interaction1089 TAS traceable author statement1073 IC inferred by curator9045 ISS inferred from sequence similarity1912 IMP inferred from mutant phenotype 522 NAS non-traceable author statement6397 IEA from electronic annotation

30333

http://www.geneontology.org/

Manual Curation• Emphasis on Primary Literature (IDA, IMP, IGI, IPI) • Manual inspection of sequence similarity (ISS)

Computational Mappings (IEA)• InterPro (domain or family) to GO• UniProt (Swissprot keyword to GO)• E.C. number to GO

GO Curation Strategy

1617 PMIDs15230 annotations

5815 annotations

9569

annotations

Data from 06/06/07

http://www.geneontology.org/

1436416682

1900820108

22530

30343 30616

0

5000

10000

15000

20000

25000

30000

35000

Jan-04Mar-04May-04Jul-04Sep-04Nov-04Jan-05Mar-05May-05Jul-05Sep-05Nov-05Jan-06Mar-06May-06Jul-06Sep-06Nov-06Jan-07Mar-07May-07

date

associations

Series1Series2Series3Series4

pombe manualpombe electronicpombe totalcerevisiae total

Total 30,616 annotations to 3080 GO terms S. cerevisiae has 27662 annotations to 2971 GO term (no IEA)

Data from 06/06/07

GO Curation Progress

http://www.geneontology.org/All three aspects unknown 105 (564 S. cerevisiae)

Function 3542 (includes protein binding)

Biological Process4019

Cellular Component4821

14672679

3279(3455)

191 54

18

Total 5004 (5780 S. cerevisiae)

993

GO aspect coverage

http://www.geneontology.org/

• A gene product can have several functions, cellular locations and be involved in many processes

• Groups of functions make up a biological process

• Annotation of a gene product to one ontology is independent from its annotation to other ontologies

• Genes with ‘no data’ are annotated to the ‘root node’

Developing GO

Developing GO

• GO under constant development• International group of developers (all the major

model organism databases contribute)– central editorial office at EBI - 4 members

• Developed in consultation with domain experts– Term suggestions handled through online

tracking system

Adding new terms and biological concepts to the Gene Ontology

http://www.geneontology.org/

Why GO changes

• Advances in biology• New organisms join, need new terms• Fix errors and legacy terms• Improve logical consistency

• Suggestions for changes come from • the GO editors and organism curators • the user community• Analysis of logical consistency

http://www.geneontology.org/

flybase

SGD

SGD

MGI

http://www.geneontology.org/

• Provides a standard for annotation• Has 2 components the ontology and the

annotations• Allows experimental work to be evaluated in the

context of other experimental data which may be annotated at different levels of granularity

• Allows biologists to search and analyse data (particularly for identifying groups of overrepresented genes in large scale experiments)

• Becomes increasingly powerful as the ontologies and annotations are refined

• More here?

Applications

http://www.geneontology.org/

• Access gene product functional information (GeneDB, AmiGO)

• Do cross species comparison (AmiGO)

• “Slimming”, Grouping data into broad (user defined) categories to provide an overview of a geneset or genome

• “Term Enrichment” (or depletion) Provide a link between biological knowledge and functional genomics data

• Data mining

Simple- All genes annotated to a term (GeneDB, AmiGO)

Complex -Using Boolean operators (union/intersect) (GeneDB)

What can scientists do with GO?

http://www.geneontology.org/

Slimming

• High level view of GO (genes annotated to granular terms are mapped to higher level terms)

• Allows users to group genes into broader categories to assess their distribution, useful for large scale, genome wide analyses or smaller gene sets

• Different Annotation groups have created specific GO_Slims are available at GO’s FTP site

• You can create and use your own GO slim with high level terms of interest

• CARE: not a gene product count, as gene products have multiple annotations

http://www.geneontology.org/

Term Enrichment

What is it?• finds significant GO terms shared among a list of genes• discover what these genes may have in common • statistical measure of how likely your differentially regulated genes fall into

that category by chance

microarray

1000 genesexperiment

100 genes differentially regualted

mitosis – 80/100apoptosis – 40/100p. ctrl. cell prol. – 30/100glucose transp. – 20/100

How?

GO Term mapper/Onto express

http://www.geneontology.org/

Data mining, complex

A B C D E F G H I J A cell division 1018 356 224 31 49 2 271 132 - - B transcription>translat. 1367 53 66 172 0 111 47 - - C cytoskeletal/morph/vmt 842 152 32 30 78 160 - - D metabolic pathways 800 196 61 36 52 - - E mitochondrial function 732 98 47 14 - - F membrane transport 299 6 2 - - G stress 422 65 - - H signal transduction 369 - - I other 323 - J none 988

What: You can data mine the entire genome to find overlaps and intersections between terms of interest to target genes for further study

http://www.geneontology.org/

Tools

• AmiGO• GO Slim Mapper- maps the specific, granular

GO terms used to annotate a list of gene products to corresponding more general parent GO slim terms.

• GO Term Finder- searches for significant shared GO terms, or parents of the GO terms, used to annotate gene products in a given list.

• GeneDB Boolean querying

http://www.geneontology.org/

Acknowledgements

• Martin Aslett (database support)• Adrian Tivey (GeneDB

programmer)• Midori Harris and the GO editorial

office• Pfam group• SGD curators

http://www.geneontology.org/

Additional points

http://www.geneontology.org/

1. NOT• a gene product is NOT associated with the GO term • to document conflicting claims in the literature.

2. Contributes to• distinguishes between individual subunit functions and

whole complex functions• used with GO Function Ontology

3. Colocalizes with• transiently or peripherally associated with an organelle

or complex • used with GO Component Ontology

Modifying the interpretation of an annotation: the

Qualifier column

http://www.geneontology.org/

Fatty acid biosynthesis (Swiss-Prot Keyword)

EC:6.4.1.2 (EC number)

IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry)

GO:Fatty acid biosynthesis

(GO:0006633)

GO:acetyl-CoA carboxylase activity

(GO:0003989)

GO:acetyl-CoA carboxylaseactivity

(GO:0003989)

Electronic Annotations

http://www.geneontology.org/

Unknown v.s. Unannotated• “Unknown” is used when the curator has

determined that there is no existing literature to support an annotation.– Biological process unknown GO:0000004– Molecular function unknown GO:0005554– Cellular component unknown GO:0008372

• NOT the same as having no annotation at all – No annotation means that no one has looked yet

top related