david

36
Extracting biological meaning from large gene list with DAVID http://david.abcc.ncifcrf.gov/home.jsp Huang et al., Curr Protoc Bioinformatics (2009) Francesco Mattia Mancuso ( [email protected] ) Bioinfarmatics Core Facility Short Tutorial

Upload: francesco-mattia-mancuso

Post on 17-May-2015

10.658 views

Category:

Education


2 download

DESCRIPTION

Short tutorials on how to use the web-based tool DAVID - Database for Annotation, Visualization and Integrated Discovery) - http://david.abcc.ncifcrf.gov/ DAVID provides a comprehensive set of functional annotation tools for investigators to understand biological meaning behind large list of genes.

TRANSCRIPT

Page 1: David

Extracting biological meaning from large gene list with DAVID

http://david.abcc.ncifcrf.gov/home.jsp

Huang et al., Curr Protoc Bioinformatics (2009)

Francesco Mattia Mancuso ([email protected])

Bioinfarmatics Core Facility

Short Tutorial

Page 2: David

Data Analysis of genes/proteins list• Downstream data analysis task to understand the

biological meaning of the output gene lists. • Challenging task

• High-throughput genomics• Proteomics • Expression microarray • Promoter microarray • ChIP-on-CHIPs • …

significant capabilities to study a large variety of biological mechanisms, including associations with diseases

large ‘interesting’ gene list (ranging in size from hundreds to thousands of genes) involved in studied biological conditions.

Introduction

Page 3: David

• Database for Annotation, Visualization and Integrated Discovery

• Released in 2003 (Dennis et al., Genome Biol.; Hosack et al., Genome Biol.)

• able to extract biological features/meaning associated with large gene lists

• able to handle any type of gene list

Common strategy with other tool:• to systematically map interesting genes in a list to the

associated biological annotation • gene ontology terms

• to statistically highlight the most overrepresented biological annotation• enrichment

DAVID

Page 4: David

The Gene Ontology Project

• Gene Ontology– a collection of controlled vocabularies describing the biology of a gene product in any organism

• http://www.geneontology.org/

“In the context of knowledge sharing, I use the term ontology to mean a specification of a conceptualization. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. This definition is consistent with the usage of ontology as set-of-concept-definitions, but more general.”

T. R. Gruber. A translation approach to portable ontologies. Knowledge Acquisition, 5(2):199-220, 1993.

Page 5: David

Main objectives of GO project• Compile and provide GO terms;• Use of structured vocabularies in the annotation of gene

products;• Provide open access to the GO database and Web resource.

Independent sets of vocabularies1. Molecular Function (MF) – elemental activity or task performed,

or potentially performed, by individual gene products (e.g. “DNA binding” and “catalytic activity”);

2. Cellular Component (CC) – location of action for a gene product (e.g. “organelle membrane” and “cytoskeleton”);

3. Biological Process (BP) – broad biological objective or goal in which a gene product participates. (e.g. “DNA replication” and “response to stimulus”).

Page 6: David

• Each term has an accession ID tracked• The accession ID belongs with the definition.

if a term changes (e.g., from “chromatin” to “structural component of chromatin”), but not the definition of the term, the accession ID will remain the same.

Directed acyclic graphs (DAGs)Semantic relationships between parent and child terms:• is_a: the child is a subclass, of

the parent (e.g. endonuclease activity is a subcategory of nuclease activity.

• part_of: the child is a component of the parent, such as a subprocess or physical part (e.g. nucleolus is part of nuclear lumen)

Page 7: David

GO Tools http://www.geneontology.org/GO.tools.shtml

• Consortium Tools: – AmiGO

– DAG-Edit• Non-Consortium Tools:

– Search and browseGOFish, QuickGO, ….

– AnnotationManatee, GeneTools, …

– Gene expressionBiNGO, GeneMerge, GOArray, GO Term Finder, …

– OthersBlast2GO, Generic GO term Mapper, GO SLIM Mapper, …

Page 8: David

Enrichment and p-values calculated with a hypergeometric distribution

N = all genes (universe)M = all genes belonging to a pathwayn = your gene listm = genes of your gene list that belongs to the pathway

Other well-known statistical methods: χ2, Fisher’s exact test, Binomial probability

Page 9: David

A 'good' gene list1. Contains many important genes (marker genes) as expected;

2. Reasonable number of genes ranging from hundreds to thousands (e.g., 100–2,000 genes), not extremely low or high;

3. Most of the genes significantly pass the statistical threshold;

4. Portion of up- or down-regulated genes are involved in certain interesting biological processes, rather than being randomly spread throughout all possible biological processes;

5. Consistently contain more enriched biology than that of a random list in the same size range;

6. High reproducibility to generate a similar gene list under the same conditions;

7. Data high quality can be confirmed by other independent experiments.

Page 10: David

DAVID homepage: http://david.abcc.ncifcrf.gov/home.jsp

Page 11: David

The wide-range collection of heterogeneous functional annotations in the DAVID Knowledgebase

Page 12: David

Analytic tools/modules in DAVID

Huang et al., Nature Protocols, 2009

Page 13: David

GENE LIST MANAGEMENT PANEL: SUBMIT AND MANAGE USER’S GENE LISTS

Page 14: David

Analytic tools/modules in DAVID

Huang et al., Nature Protocols, 2009

Page 15: David

GENE NAME BATCH VIEWER: EXPLORE GENE NAMES BASED ON USER’S GENE IDs

Page 16: David

ID CONVERSION TOOL: CONVERT USERS’ GENE IDs TO DIFFERENT TYPES

Page 17: David
Page 18: David

Exercise 1Submit data and convert the IDs

Cicala, C. et al. HIV envelope induces a cascade of cell signals in non-proliferating target cells that favor virus replication. Proc. Natl. Acad. Sci. USA 99, 9380–9385 (2002).

“Freshly isolated peripheral blood mononuclear cells were treated with an HIV envelope protein (gp120) and genome-wide gene expression changes were observed using Affymetrix U95A microarray chips. The aim of the experiment was to investigate cellular responses to viral envelope protein infection, which may help in understanding the mechanisms for HIV replication in resting or sub-optimally activated peripheral blood mononuclear cells.”

DOWNLOAD THE DATASET FROM :

http://www.nature.com/nprot/journal/v4/n1/suppinfo/nprot.2008.211_S1.html Supplementary Data 2

Page 19: David

Analytic tools/modules in DAVID

Huang et al., Nature Protocols, 2009

Page 20: David

GENE FUNCTIONAL CLASSIFICATION TOOL: CLASSIFY USERS’ GENES INTO CO-FUNCTIONAL GENE GROUPS

Page 21: David
Page 22: David
Page 23: David

Analytic tools/modules in DAVID

Huang et al., Nature Protocols, 2009

Page 24: David

FUNCTIONAL ANNOTATION TOOL: IDENTIFY ENRICHEDBIOLOGY WITHIN USERS’ GENE LISTS

Page 25: David
Page 26: David

Analytic tools/modules in DAVID

Huang et al., Nature Protocols, 2009

Page 27: David
Page 28: David

Functional Annotation Chart

Page 29: David

Analytic tools/modules in DAVID

Huang et al., Nature Protocols, 2009

Page 30: David

Functional Annotation Clustering

Page 31: David

Analytic tools/modules in DAVID

Huang et al., Nature Protocols, 2009

Page 32: David

Functional Annotation Table

There is no statistics applied in this report.

Page 33: David

Attention!!!!!DAVID enrichment analysis is more of an exploratory

procedure than a pure statistical solution.

“The final interpretation and analytic result decisions (in terms of accepting the results that make sense biologically in the context of the study, or rejecting ones that do not) should be made by the biologists/analysts themselves, rather than by any of the tools.”

(Huang et al., 2009)

Page 34: David

Exercise 2

- Employing the previous list use all the functional classification and functional annotation tools with different options.- Compare your results with the results presents in Huang et al., Nature Protocols, 2009

Play with the functional classification and annotation tools

ANY QUESTIONS?

Page 35: David

Appendix

Page 36: David

• Count Threshold (Minimum Count): the threshold of minimum gene counts belonging to an annotation term. It has to be equal or greater than 0. Default is 2. In short, you do not trust the term only having one gene involved.• EASE Score Threshold (Maximum Probability): the threshold of EASE Score, a modified Fisher Exact P-value, for gene-enrichment analysis. It ranges from 0 to 1. Fisher Exact P-Value = 0 represents perfect enrichment.• The Fold Enrichment is defined as the ratio of the two proportions. For example, if 40/400 (i.e. 10%) of your input genes involved in "kinase activity" and the background information is 300/30000 genes (i.e. 1%) associating with "kinase activity", roughly 10% / 1% = 10 fold enrichment.• In DAVID annotation system, Fisher Exact is adopted to measure the gene-enrichment in annotation terms. When members of two independent groups can fall into one of two mutually exclusive categories, Fisher Exact test is used to determine whether the proportions of those falling into each category differs by group. • Benjamini-Hochberg, Bonferroni, FDR (False Discovery Rate) are different 'standard' statistics for multiple comparison corrections. They correct P-values to be more conservative in order to lower family-wise false discovery rate.• LT (list total): number of genes in your gene list mapped to any term in this ontology ("system”)• PH (population hits): number of genes with this GO term on the background list (the whole chip)• PT (population total): number of genes on the background list (the whole chip) mapped to any term in this ontology ("system”)